[Numpy-discussion] Re: Performance mystery

Warren Weckesser Sun, 23 Jan 2022 17:18:07 -0800

On 1/20/22, Francesc Alted <fal...@gmail.com> wrote:
> On Wed, Jan 19, 2022 at 7:48 PM Francesc Alted <fal...@gmail.com> wrote:
>
>> On Wed, Jan 19, 2022 at 6:58 PM Stanley Seibert <sseib...@anaconda.com>
>> wrote:
>>
>>> Given that this seems to be Linux only, is this related to how glibc
>>> does
>>> large allocations (>128kB) using mmap()?
>>>
>>> https://stackoverflow.com/a/33131385
>>>
>>
>> That's a good point. As MMAP_THRESHOLD is 128 KB, and the size of `z` is
>> almost 4 MB, mmap machinery is probably getting involved here.  Also, as
>> pages acquired via anonymous mmap are not actually allocated until you
>> access them the first time, that would explain that the first access is
>> slow.  What puzzles me is that the timeit loops access `z` data 3*10000
>> times, which is plenty of time for doing the allocation (just should
>> require just a single iteration).
>>
>
> I think I have more evidence that what is happening here has to see of how
> the malloc mechanism works in Linux.  I find the next explanation to be
> really good:
>
> https://sourceware.org/glibc/wiki/MallocInternals
>
> In addition, this excerpt of the mallopt manpage (
> https://man7.org/linux/man-pages/man3/mallopt.3.html) is very significant:
>
>               Note: Nowadays, glibc uses a dynamic mmap threshold by
>               default.  The initial value of the threshold is 128*1024,
>               but when blocks larger than the current threshold and less
>               than or equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the
>               threshold is adjusted upward to the size of the freed
>               block.  When dynamic mmap thresholding is in effect, the
>               threshold for trimming the heap is also dynamically
>               adjusted to be twice the dynamic mmap threshold.  Dynamic
>               adjustment of the mmap threshold is disabled if any of the
>               M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or
>               M_MMAP_MAX parameters is set.
>
> This description matches closely what is happening here: after `z` is freed
> (replaced by another random array in the second part of the calculation),
> then dynamic mmap threshold enters and the threshold is increased by 2x of
> the freed block (~4MB in this case), so for the second part, the program
> break
> (i.e. where the heap ends) is increased instead, which is faster because
> this memory does not need to be zeroed before use.
>
> Interestingly, the M_MMAP_THRESHOLD for system malloc can be set by using
> the MALLOC_MMAP_THRESHOLD_ environment variable.  For example, the original
> times are:
>
> $ python mmap-numpy.py
> numpy version 1.20.3
>
>  635.4752 microseconds
>  635.8906 microseconds
>  636.0661 microseconds
>
>  144.7238 microseconds
>  143.9147 microseconds
>  144.0621 microseconds
>
> but if we enforce to always use mmap:
>
> $ MALLOC_MMAP_THRESHOLD_=0 python mmap-numpy.py
> numpy version 1.20.3
>
>  628.8890 microseconds
>  628.0965 microseconds
>  628.7590 microseconds
>
>  640.9369 microseconds
>  641.5104 microseconds
>  642.4027 microseconds
>
> so first and second parts executes at the same (slow) speed.  And, if we
> set the threshold to be exactly 4 MB:
>
> $ MALLOC_MMAP_THRESHOLD_=4194304 python mmap-numpy.py
> numpy version 1.20.3
>
>  630.7381 microseconds
>  631.3634 microseconds
>  632.2200 microseconds
>
>  382.6925 microseconds
>  380.1790 microseconds
>  380.0340 microseconds
>
> we see how performance is increased for the second part (although that not
> as much as without specifying the threshold manually; probably this manual
> setting prevents other optimizations to quick in).
>
> As a final check, if we use other malloc systems, like the excellent
> mimalloc (https://github.com/microsoft/mimalloc), we can get really good
> performance for the two parts:
>
> $ LD_PRELOAD=/usr/local/lib/libmimalloc.so  python mmap-numpy.py
> numpy version 1.20.3
>
>  147.5968 microseconds
>  146.9028 microseconds
>  147.1794 microseconds
>
>  148.0905 microseconds
>  147.7667 microseconds
>  147.5180 microseconds
>
> However, as this is avoiding the mmap() calls, this approach probably uses
> more memory, specially when large arrays need to be handled.
>
> All in all, this is testimonial of how much memory handling can affect
> performance in modern computers.  Perhaps it is time for testing different
> memory allocation strategies in NumPy and come up with suggestions for
> users.
>
> Francesc
>
>
>
>>
>>
>>>
>>>
>>> On Wed, Jan 19, 2022 at 9:06 AM Sebastian Berg <
>>> sebast...@sipsolutions.net> wrote:
>>>
>>>> On Wed, 2022-01-19 at 11:49 +0100, Francesc Alted wrote:
>>>> > On Wed, Jan 19, 2022 at 7:33 AM Stefan van der Walt
>>>> > <stef...@berkeley.edu>
>>>> > wrote:
>>>> >
>>>> > > On Tue, Jan 18, 2022, at 21:55, Warren Weckesser wrote:
>>>> > > > expr = 'z.real**2 + z.imag**2'
>>>> > > >
>>>> > > > z = generate_sample(n, rng)
>>>> > >
>>>> > > 🤔 If I duplicate the `z = ...` line, I get the fast result
>>>> > > throughout.
>>>> > > If, however, I use `generate_sample(1, rng)` (or any other value
>>>> > > than `n`),
>>>> > > it does not improve matters.
>>>> > >
>>>> > > Could this be a memory caching issue?
>>>> > >
>>>>
>>>> Yes, it is a caching issue for sure.  We have seen similar random
>>>> fluctuations before.
>>>> You can proof that it is a cache page-fault issue by running it with
>>>> `perf --stat`.  I did this twice, once with the second loop removed
>>>> (page-faults only):
>>>>
>>>> 28333629      page-faults               #  936.234 K/sec
>>>> 28362718      page-faults               #    1.147 M/sec
>>>>
>>>> The number of page faults is low.  Running only the second one (or
>>>> running the first one only once, rather), gave me:
>>>>
>>>> 15024      page-faults               #    1.837 K/sec
>>>>
>>>> So that is the *reason*.  I had before tried to figure out why the page
>>>> faults differ too much, or if we can do something about it.  But I
>>>> never had any reasonable lead on it.
>>>>
>>>> In general, these fluctuations are pretty random, in the sense that
>>>> unrelated code changes and recompilation can swap the behaviour easily.
>>>> As Andras noted in that he sees the opposite.
>>>>
>>>> I would love to have an idea if there is a way to figure out why the
>>>> page-faults are so imbalanced between the two.
>>>>
>>>> (I have not looked at CPU cache misses this time, but since page-faults
>>>> happen, I assume that should not matter?)
>>>>
>>>> Cheers,
>>>>
>>>> Sebastian
>>>>
>>>>
>>>> >
>>>> > I can also reproduce that, but only on my Linux boxes.  My MacMini
>>>> > does not
>>>> > notice the difference.
>>>> >
>>>> > Interestingly enough, you don't even need an additional call to
>>>> > `generate_sample(n, rng)`. If one use `z = np.empty(...)` and then do
>>>> > an
>>>> > assignment, like:
>>>> >
>>>> > z = np.empty(n, dtype=np.complex128)
>>>> > z[:] = generate_sample(n, rng)
>>>> >
>>>> > then everything runs at the same speed:
>>>> >
>>>> > numpy version 1.20.3
>>>> >
>>>> >  142.3667 microseconds
>>>> >  142.3717 microseconds
>>>> >  142.3781 microseconds
>>>> >
>>>> >  142.7593 microseconds
>>>> >  142.3579 microseconds
>>>> >  142.3231 microseconds
>>>> >
>>>> > As another data point, by doing the same operation but using numexpr
>>>> > I am
>>>> > not seeing any difference either, not even on Linux:
>>>> >
>>>> > numpy version 1.20.3
>>>> > numexpr version 2.8.1
>>>> >
>>>> >   95.6513 microseconds
>>>> >   88.1804 microseconds
>>>> >   97.1322 microseconds
>>>> >
>>>> >  105.0833 microseconds
>>>> >  100.5555 microseconds
>>>> >  100.5654 microseconds
>>>> >
>>>> > [it is rather like a bit the other way around, the second iteration
>>>> > seems a
>>>> > hair faster]
>>>> > See the numexpr script below.
>>>> >
>>>> > I am totally puzzled here.
>>>> >
>>>> > """
>>>> > import timeit
>>>> > import numpy as np
>>>> > import numexpr as ne
>>>> >
>>>> >
>>>> > def generate_sample(n, rng):
>>>> >     return rng.normal(scale=1000, size=2*n).view(np.complex128)
>>>> >
>>>> >
>>>> > print(f'numpy version {np.__version__}')
>>>> > print(f'numexpr version {ne.__version__}')
>>>> > print()
>>>> >
>>>> > rng = np.random.default_rng()
>>>> > n = 250000
>>>> > timeit_reps = 10000
>>>> >
>>>> > expr = 'ne.evaluate("zreal**2 + zimag**2")'
>>>> >
>>>> > z = generate_sample(n, rng)
>>>> > zreal = z.real
>>>> > zimag = z.imag
>>>> > for _ in range(3):
>>>> >     t = timeit.timeit(expr, globals=globals(), number=timeit_reps)
>>>> >     print(f"{1e6*t/timeit_reps:9.4f} microseconds")
>>>> > print()
>>>> >
>>>> > z = generate_sample(n, rng)
>>>> > zreal = z.real
>>>> > zimag = z.imag
>>>> > for _ in range(3):
>>>> >     t = timeit.timeit(expr, globals=globals(), number=timeit_reps)
>>>> >     print(f"{1e6*t/timeit_reps:9.4f} microseconds")
>>>> > print()
>>>> > """
>>>> >
>>>> > _______________________________________________
>>>> > NumPy-Discussion mailing list -- numpy-discussion@python.org
>>>> > To unsubscribe send an email to numpy-discussion-le...@python.org
>>>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>> > Member address: sebast...@sipsolutions.net
>>>>
>>>> _______________________________________________
>>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>>> Member address: sseib...@anaconda.com
>>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: fal...@gmail.com
>>>
>>
>>
>> --
>> Francesc Alted
>>
>
>
> --
> Francesc Alted
>


Thanks Sebastian for pointing out that the issue is (minor) page
faults, and thanks Francesc for providing the links and the analysis
of the situation.  After reading those links and experimenting with
malloc in a couple C programs, it is clear now what is happening.  The
key factors to be aware of are:

* What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how
they are updated dynamically.
* Minor page faults are triggered by new allocations from the system,
whether via mmap or via sbrk(). This means avoiding mmap is not enough
to avoid potential problems with minor page faults.

In my test program, the pattern of allocating and freeing the 4M array
followed by repeatedly allocating and freeing 2M temporaries (in the
first batch of timeit() calls) triggers, in effect, a thrashing of the
top of the heap, in which memory is repeatedly allocated from and then
returned to the system via sbrk().

Here's detailed timeline of a simplified version of what the test was
doing, if anyone is interested in the gory details...

First note that the default value of M_MMAP_THRESHOLD is 128*1024.
For allocations above this value, mmap will be used.

The simplified pseudocode of the test program (not necessarily
including all the temporaries that might be generated in the actual
Python code):

```
    z = malloc(4000000)   # More than M_MMAP_THRESHOLD so mmap is used.
    z[...] = ...          # Triggers 976 minor page faults (+ or -).

    # The test program computes z.real**2 + z.imag**2 10000 times
    # in the timeit call. This requires temporary arrays; call them
    # t1 and t2. The first time through the loop, we have...

    t1 = malloc(2000000)  # More than M_MMAP_THRESHOLD so mmap is used.
    t1[...] = ...         # Triggers 488 minor page faults.
    t2 = malloc(2000000)  # Uses mmap.
    t2[...] = ...         # Another 488 minor page faults.

    # The dynamic updating of the malloc parameters occurs in the call
    # to free().  These allocations used mmap, so the call to free()
    # returns the memory to the system.

    free(t1)              # Dynamic update of thresholds

    # Now M_MMAP_THRESHOLD=2000000, M_TRIM_THRESHOLD=4000000

    free(t2)

    # The second iteration of the timeit function:

    t1 = malloc(2000000)  # Less than M_MMAP_THRESHOLD; heap is used.
    t1[...] = ...         # Triggers 488 minor page faults.
    t2 = malloc(2000000)  # Uses heap.
    t2[...] = ...         # Another 488 minor page faults.

    # The allocations of t1 and t2 having caused the heap to grow via
    # call(s) to sbrk().

    free(t1)
    free(t2)

    # When free(t2) is called, there is a 4000000 byte block at the
    # end of the heap whose size equals M_TRIM_THRESHOLD, so it is
    # returned to the system via sbrk().

    # The remaining 9998 iterations are the same...

    t1 = malloc(2000000)  # Less than M_MMAP_THRESHOLD; heap is used.
    t1[...] = ...         # Triggers 488 minor page faults.
    t2 = malloc(2000000)  # Uses heap.
    t2[...] = ...         # Another 488 minor page faults.

    # Once again, the allocations of t1 and t2 having caused the heap
    # to grow via call(s) to sbrk().

    free(t1)
    free(t2)

    # And once again, there is now a 4000000 byte block at the end of
    # the heap whose size equals M_TRIM_THRESHOLD, so it is returned
    # to the system via sbrk().

    # The important point here is that in each iteration after the
    # first, malloc() allocates memory from the system and returns
    # it to the system via sbrk().  Writing to that newly allocated
    # memory the first time triggers the minor page faults.

    # After running timeit with the first z, we free z and then
    # reallocate another 4000000 for z.

    free(z)

    # This free() triggers another dynamic update, so now
    # M_MMAP_THRESHOLD=4000000 and M_TRIM_THRESHOLD=8000000.

    z = malloc(4000000)   # Uses heap.
    z[...] = ...          # 976 minor page faults.

    # First timeit iteration with the new z:

    t1 = malloc(2000000)  # Less than M_MMAP_THRESHOLD; heap is used.
    t1[...] = ...         # Triggers 488 minor page faults.
    t2 = malloc(2000000)  # Uses heap.
    t2[...] = ...         # Another 488 minor page faults.

    free(t1)              # Thresholds are not updated in this case.
    free(t2)

    # The above calls to free() do not result in the heap being
    # trimmed, because M_TRIM_THRESHOLD is now 8000000, and freeing
    # the two 2M blocks does not create a big enough block at the
    # top of the heap to be trimmed.

    # Remaing 9999 timeit iterations...

    # Because the heap was not trimmed, the following calls to
    # malloc() for t1 and t2 will not require the heap to be grown
    # via sbrk().

    t1 = malloc(2000000)  # Less than M_MMAP_THRESHOLD; heap is used.
    t1[...] = ...         # No minor page faults (yay!).
    t2 = malloc(2000000)  # Uses heap.
    t2[...] = ...         # No minor page faults.

    free(t1)              # Thresholds are not updated in this case.
    free(t2)

    # As before since M_TRIM_THRESHOLD is now 8000000, the calls to
    # free() above do not trigger a trim of the heap.

```

When the timeit loops are run with the second z array, minor page
faults occur only when z is initialized, and in the first iteration of
the timeit loop (the first time the temporary arrays are created and
written to).  In the subsequent 9999 iterations, the memory for the
temporaries is available on the heap, so no more minor page faults are
generated.  This is why the timeit results with the second z array are
much better.

If we set the M_MMAP_THRESHOLD to 0 via the MALLOC_MMAP_THRESHOLD_
environment variable, we force all allocations to use mmap, and the
corresponding free() returns the memory the system, so the timings are
equally slow for the first and second z array.

If we set M_MMAP_THRESHOLD to value large enough that we never use
mmap, we get results like this (also noted by Francesc):

```
$ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py
numpy version 1.20.3

 639.5976 microseconds
 641.1489 microseconds
 640.8228 microseconds

 385.0931 microseconds
 384.8406 microseconds
 384.0398 microseconds
```

In this case, setting the mmap threshold disables the dynamic updating
of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024.  This
means each time free() is called, there will almost certainly be
enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and
so each iteration in timeit results memory being allocated from the
system and returned to the system via sbrk(), and that leads to minor
page faults when writing to the temporary arrays in each iterations.
That is, we still get thrashing at the top of the heap.  (Apparenly
not all the calls of free() lead to trimming the heap, so the
performance is better with the second z array.)

To get good performance for both the first and second z arrays, we can
manually set both the mmap threshold and the trim threshold:

```
$ MALLOC_MMAP_THRESHOLD_=20000000 MALLOC_TRIM_THRESHOLD_=20000000
python timing_question.py
numpy version 1.20.3

 140.9165 microseconds
 148.2753 microseconds
 150.8209 microseconds

 152.6369 microseconds
 152.5829 microseconds
 152.4377 microseconds
```

Or disable trimming completely:

```
$ MALLOC_MMAP_THRESHOLD_=20000000 MALLOC_TRIM_THRESHOLD_=-1 python
timing_question.py
numpy version 1.20.3

 144.9151 microseconds
 147.7243 microseconds
 153.9987 microseconds

 155.9898 microseconds
 156.0203 microseconds
 155.9820 microseconds
```

We can also avoid the issue by using an alternative memory management
library such as mimalloc (noted by Francesc) or jemalloc.


Warren
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Performance mystery

Reply via email to