First, some notes: 1. Linux defaults to `--tlsemulation:off`, macOS defaults to `--tlsemulation:on`. I'm not sure how that would affect the behavior you mention, though. 2. Any computer with 48 cores is going to be a NUMA architecture, so if your code runs very slow, that might be the reason. Tweaking allocation-heavy multithreaded code to run properly on a NUMA architecture can be tricky. It's often best to limit the number of threads to the number of cores on a processor and use `numactl` to make sure everything happens locally. Use multiprocessing if you want higher scalability. 3. If you're allocating lots of very large blocks of memory, fragmentation is going to hurt you sooner or later. The only solution for that would be a compacting garbage collector. I'm not sure what you're allocating in your actual code. If you're reading files into memory, the `memfiles` module might help (reading large files in a memory-constrained situation is generally problematic, as they occupy OS cache space and space where they finally end up).
As for single-threaded free memory growing indefinitely, that is very puzzling. Have you tried running with `--gc:markandsweep` and `--gc:boehm` as alternatives? The Boehm GC is a bit more wasteful on memory, but it may help narrowing down the problem.
