[Numpy-discussion] Re: Performance mystery

Francesc Alted Thu, 20 Jan 2022 05:41:39 -0800

On Wed, Jan 19, 2022 at 7:48 PM Francesc Alted <fal...@gmail.com> wrote:


> On Wed, Jan 19, 2022 at 6:58 PM Stanley Seibert <sseib...@anaconda.com>
> wrote:
>
>> Given that this seems to be Linux only, is this related to how glibc does
>> large allocations (>128kB) using mmap()?
>>
>> https://stackoverflow.com/a/33131385
>>
>
> That's a good point. As MMAP_THRESHOLD is 128 KB, and the size of `z` is
> almost 4 MB, mmap machinery is probably getting involved here.  Also, as
> pages acquired via anonymous mmap are not actually allocated until you
> access them the first time, that would explain that the first access is
> slow.  What puzzles me is that the timeit loops access `z` data 3*10000
> times, which is plenty of time for doing the allocation (just should
> require just a single iteration).
>

I think I have more evidence that what is happening here has to see of how
the malloc mechanism works in Linux.  I find the next explanation to be
really good:

https://sourceware.org/glibc/wiki/MallocInternals

In addition, this excerpt of the mallopt manpage (
https://man7.org/linux/man-pages/man3/mallopt.3.html) is very significant:

              Note: Nowadays, glibc uses a dynamic mmap threshold by
              default.  The initial value of the threshold is 128*1024,
              but when blocks larger than the current threshold and less
              than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the
              threshold is adjusted upward to the size of the freed
              block.  When dynamic mmap thresholding is in effect, the
              threshold for trimming the heap is also dynamically
              adjusted to be twice the dynamic mmap threshold.  Dynamic
              adjustment of the mmap threshold is disabled if any of the
              M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or
              M_MMAP_MAX parameters is set.

This description matches closely what is happening here: after `z` is freed
(replaced by another random array in the second part of the calculation),
then dynamic mmap threshold enters and the threshold is increased by 2x of
the freed block (~4MB in this case), so for the second part, the program break
(i.e. where the heap ends) is increased instead, which is faster because
this memory does not need to be zeroed before use.

Interestingly, the M_MMAP_THRESHOLD for system malloc can be set by using
the MALLOC_MMAP_THRESHOLD_ environment variable.  For example, the original
times are:

$ python mmap-numpy.py
numpy version 1.20.3

 635.4752 microseconds
 635.8906 microseconds
 636.0661 microseconds

 144.7238 microseconds
 143.9147 microseconds
 144.0621 microseconds

but if we enforce to always use mmap:

$ MALLOC_MMAP_THRESHOLD_=0 python mmap-numpy.py
numpy version 1.20.3

 628.8890 microseconds
 628.0965 microseconds
 628.7590 microseconds

 640.9369 microseconds
 641.5104 microseconds
 642.4027 microseconds

so first and second parts executes at the same (slow) speed.  And, if we
set the threshold to be exactly 4 MB:

$ MALLOC_MMAP_THRESHOLD_=4194304 python mmap-numpy.py
numpy version 1.20.3

 630.7381 microseconds
 631.3634 microseconds
 632.2200 microseconds

 382.6925 microseconds
 380.1790 microseconds
 380.0340 microseconds

we see how performance is increased for the second part (although that not
as much as without specifying the threshold manually; probably this manual
setting prevents other optimizations to quick in).

As a final check, if we use other malloc systems, like the excellent
mimalloc (https://github.com/microsoft/mimalloc), we can get really good
performance for the two parts:

$ LD_PRELOAD=/usr/local/lib/libmimalloc.so  python mmap-numpy.py
numpy version 1.20.3

 147.5968 microseconds
 146.9028 microseconds
 147.1794 microseconds

 148.0905 microseconds
 147.7667 microseconds
 147.5180 microseconds

However, as this is avoiding the mmap() calls, this approach probably uses
more memory, specially when large arrays need to be handled.

All in all, this is testimonial of how much memory handling can affect
performance in modern computers.  Perhaps it is time for testing different
memory allocation strategies in NumPy and come up with suggestions for
users.

Francesc



>
>
>>
>>
>> On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg <
>> sebast...@sipsolutions.net> wrote:
>>
>>> On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
>>> > On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt
>>> > <stef...@berkeley.edu>
>>> > wrote:
>>> >
>>> > > On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
>>> > > > expr = 'z.real**2 + z.imag**2'
>>> > > >
>>> > > > z = generate_sample(n, rng)
>>> > >
>>> > > 🤔 If I duplicate the `z = ...` line, I get the fast result
>>> > > throughout.
>>> > > If, however, I use `generate_sample(1, rng)` (or any other value
>>> > > than `n`),
>>> > > it does not improve matters.
>>> > >
>>> > > Could this be a memory caching issue?
>>> > >
>>>
>>> Yes, it is a caching issue for sure.  We have seen similar random
>>> fluctuations before.
>>> You can proof that it is a cache page-fault issue by running it with
>>> `perf --stat`.  I did this twice, once with the second loop removed
>>> (page-faults only):
>>>
>>> 28333629      page-faults               #  936.234 K/sec
>>> 28362718      page-faults               #    1.147 M/sec
>>>
>>> The number of page faults is low.  Running only the second one (or
>>> running the first one only once, rather), gave me:
>>>
>>> 15024      page-faults               #    1.837 K/sec
>>>
>>> So that is the *reason*.  I had before tried to figure out why the page
>>> faults differ too much, or if we can do something about it.  But I
>>> never had any reasonable lead on it.
>>>
>>> In general, these fluctuations are pretty random, in the sense that
>>> unrelated code changes and recompilation can swap the behaviour easily.
>>> As Andras noted in that he sees the opposite.
>>>
>>> I would love to have an idea if there is a way to figure out why the
>>> page-faults are so imbalanced between the two.
>>>
>>> (I have not looked at CPU cache misses this time, but since page-faults
>>> happen, I assume that should not matter?)
>>>
>>> Cheers,
>>>
>>> Sebastian
>>>
>>>
>>> >
>>> > I can also reproduce that, but only on my Linux boxes.  My MacMini
>>> > does not
>>> > notice the difference.
>>> >
>>> > Interestingly enough, you don't even need an additional call to
>>> > `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do
>>> > an
>>> > assignment, like:
>>> >
>>> > z = np.empty(n, dtype=np.complex128)
>>> > z[:] = generate_sample(n, rng)
>>> >
>>> > then everything runs at the same speed:
>>> >
>>> > numpy version 1.20.3
>>> >
>>> >  142.3667 microseconds
>>> >  142.3717 microseconds
>>> >  142.3781 microseconds
>>> >
>>> >  142.7593 microseconds
>>> >  142.3579 microseconds
>>> >  142.3231 microseconds
>>> >
>>> > As another data point, by doing the same operation but using numexpr
>>> > I am
>>> > not seeing any difference either, not even on Linux:
>>> >
>>> > numpy version 1.20.3
>>> > numexpr version 2.8.1
>>> >
>>> >   95.6513 microseconds
>>> >   88.1804 microseconds
>>> >   97.1322 microseconds
>>> >
>>> >  105.0833 microseconds
>>> >  100.5555 microseconds
>>> >  100.5654 microseconds
>>> >
>>> > [it is rather like a bit the other way around, the second iteration
>>> > seems a
>>> > hair faster]
>>> > See the numexpr script below.
>>> >
>>> > I am totally puzzled here.
>>> >
>>> > """
>>> > import timeit
>>> > import numpy as np
>>> > import numexpr as ne
>>> >
>>> >
>>> > def generate_sample(n, rng):
>>> >     return rng.normal(scale=1000, size=2*n).view(np.complex128)
>>> >
>>> >
>>> > print(f'numpy version {np.__version__}')
>>> > print(f'numexpr version {ne.__version__}')
>>> > print()
>>> >
>>> > rng = np.random.default_rng()
>>> > n = 250000
>>> > timeit_reps = 10000
>>> >
>>> > expr = 'ne.evaluate("zreal**2 + zimag**2")'
>>> >
>>> > z = generate_sample(n, rng)
>>> > zreal = z.real
>>> > zimag = z.imag
>>> > for _ in range(3):
>>> >     t = timeit.timeit(expr, globals=globals(), number=timeit_reps)
>>> >     print(f"{1e6*t/timeit_reps:9.4f} microseconds")
>>> > print()
>>> >
>>> > z = generate_sample(n, rng)
>>> > zreal = z.real
>>> > zimag = z.imag
>>> > for _ in range(3):
>>> >     t = timeit.timeit(expr, globals=globals(), number=timeit_reps)
>>> >     print(f"{1e6*t/timeit_reps:9.4f} microseconds")
>>> > print()
>>> > """
>>> >
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> > To unsubscribe send an email to numpy-discussion-le...@python.org
>>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> > Member address: sebast...@sipsolutions.net
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: sseib...@anaconda.com
>>>
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: fal...@gmail.com
>>
>
>
> --
> Francesc Alted
>


-- 
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Performance mystery

Reply via email to