Thanks for the ideas. I did not get much info from running boehm. When I have more time, I'll try markandsweep, and also tlsemulation.
> It's often best to limit the number of threads to the number of cores on a > processor... I don't know how to do that. Nim's threadpool does not offer a way. The previous solution used Python/multiprocessing + Cython. You can argue that it avoids NUMA problems via multiprocessing, but it did not suffer memory growth. Possibly calloc/free uses mmap/munmap for large blocks. Interestingly, my little Nim example on OSX sped up after total memory usage quickly stabilized. So maybe it's better optimized for large, repeated alloc/free. But I don't know whether the Linux slow-down is from the large allocs, the many small allocs, or basic processing. > If you're allocating lots of very large blocks of memory, fragmentation is > going to hurt you sooner or later. The only solution for that would be a > compacting garbage collector. I'm not sure what you're allocating in your > actual code. Yeah, that's what I think is going on. I would love to be able to view the freelist lengths when some flag is set, so I wouldn't have to wonder. Would such a change likely be accepted in GitHub?
