Excellent! Yes, it is working now, on both Ubuntu (and Centos6.6, built on Ubuntu) and OSX. With one worker thread, memory consumption is very low and stable, around 300MB. Beautiful!
With the earlier fix, there was actually a disturbing diff between expected and new output, indicating a really subtle memory bug, but that is fixed on origin/devel now too. I will concentrate on runtime next, and experiment with multiple threads.
