First, some notes:

  1. Linux defaults to `--tlsemulation:off`, macOS defaults to 
`--tlsemulation:on`. I'm not sure how that would affect the behavior you 
mention, though.
  2. Any computer with 48 cores is going to be a NUMA architecture, so if your 
code runs very slow, that might be the reason. Tweaking allocation-heavy 
multithreaded code to run properly on a NUMA architecture can be tricky. It's 
often best to limit the number of threads to the number of cores on a processor 
and use `numactl` to make sure everything happens locally. Use multiprocessing 
if you want higher scalability.
  3. If you're allocating lots of very large blocks of memory, fragmentation is 
going to hurt you sooner or later. The only solution for that would be a 
compacting garbage collector. I'm not sure what you're allocating in your 
actual code. If you're reading files into memory, the `memfiles` module might 
help (reading large files in a memory-constrained situation is generally 
problematic, as they occupy OS cache space and space where they finally end up).



As for single-threaded free memory growing indefinitely, that is very puzzling. 
Have you tried running with `--gc:markandsweep` and `--gc:boehm` as 
alternatives? The Boehm GC is a bit more wasteful on memory, but it may help 
narrowing down the problem.

Reply via email to