hi, I noticed that during some simplistic benchmarks (e.g. https://github.com/numpy/numpy/issues/4310) a lot of time is spent in the kernel zeroing pages. This is because under linux glibc will always allocate large memory blocks with mmap. As these pages can come from other processes the kernel must zero them for security reasons.
For memory within the numpy process this unnecessary and possibly a large overhead for the many temporaries numpy creates. The behavior of glibc can be tuned to change the threshold at which it starts using mmap but that would be a platform specific fix. I was thinking about adding a thread local cache of pointers to of allocated memory. When an array is created it tries to get its memory from the cache and when its deallocated it returns it to the cache. The threshold and cached memory block sizes could be adaptive depending on the application workload. For simplistic temporary heavy benchmarks this eliminates the time spent in the kernel (system with time). But I don't know how relevant this is for real world applications. Have you noticed large amounts of time spent in the kernel in your apps? I also found this paper which describes pretty much exactly what I'm proposing: pyhpc.org/workshop/papers/Doubling.pdf Someone know why their changes were never incorporated in numpy? I couldn't find a reference in the list archive. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion