hi,
I noticed that during some simplistic benchmarks (e.g.
https://github.com/numpy/numpy/issues/4310) a lot of time is spent in
the kernel zeroing pages.
This is because under linux glibc will always allocate large memory
blocks with mmap. As these pages can come from other processes the
kernel must zero them for security reasons.

For memory within the numpy process this unnecessary and possibly a
large overhead for the many temporaries numpy creates.

The behavior of glibc can be tuned to change the threshold at which it
starts using mmap but that would be a platform specific fix.

I was thinking about adding a thread local cache of pointers to of
allocated memory.
When an array is created it tries to get its memory from the cache and
when its deallocated it returns it to the cache.
The threshold and cached memory block sizes could be adaptive depending
on the application workload.

For simplistic temporary heavy benchmarks this eliminates the time spent
in the kernel (system with time).

But I don't know how relevant this is for real world applications.
Have you noticed large amounts of time spent in the kernel in your apps?

I also found this paper which describes pretty much exactly what I'm
proposing:
pyhpc.org/workshop/papers/Doubling.pdf‎

Someone know why their changes were never incorporated in numpy? I
couldn't find a reference in the list archive.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to