[Numpy-discussion] Re: Performance mystery

Francesc Alted Tue, 25 Jan 2022 01:12:34 -0800

On Mon, Jan 24, 2022 at 1:42 PM Francesc Alted <fal...@gmail.com> wrote:


> On Mon, Jan 24, 2022 at 11:01 AM Francesc Alted <fal...@gmail.com> wrote:
>
>> On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser <
>> warren.weckes...@gmail.com> wrote:
>>
>>> Thanks Sebastian for pointing out that the issue is (minor) page
>>> faults, and thanks Francesc for providing the links and the analysis
>>> of the situation.  After reading those links and experimenting with
>>> malloc in a couple C programs, it is clear now what is happening.  The
>>> key factors to be aware of are:
>>
>>
>>> * What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how
>>> they are updated dynamically.
>>> * Minor page faults are triggered by new allocations from the system,
>>> whether via mmap or via sbrk(). This means avoiding mmap is not enough
>>> to avoid potential problems with minor page faults.
>>>
>> <snip>
>>
>> Thanks for going down to the rabbit hole.  It is now much clearer what's
>> going on.
>>
>> If we set M_MMAP_THRESHOLD to value large enough that we never use
>>> mmap, we get results like this (also noted by Francesc):
>>>
>>> ```
>>> $ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py
>>> numpy version 1.20.3
>>>
>>>  639.5976 microseconds
>>>  641.1489 microseconds
>>>  640.8228 microseconds
>>>
>>>  385.0931 microseconds
>>>  384.8406 microseconds
>>>  384.0398 microseconds
>>> ```
>>>
>>> In this case, setting the mmap threshold disables the dynamic updating
>>> of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024.  This
>>> means each time free() is called, there will almost certainly be
>>> enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and
>>> so each iteration in timeit results memory being allocated from the
>>> system and returned to the system via sbrk(), and that leads to minor
>>> page faults when writing to the temporary arrays in each iterations.
>>> That is, we still get thrashing at the top of the heap.  (Apparenly
>>> not all the calls of free() lead to trimming the heap, so the
>>> performance is better with the second z array.)
>>>
>>
>> <snip>
>>
>> This is still something that I don't get.  I suppose this is very minor,
>> but it would be nice to see the reason for this.
>>
>
> FWIW, after some more reflection upon this, I think that a possible
> explanation for the above is that in both cases, the temporaries are
> returned using mmap(), and hence both incurring in the zeroing overhead.
> If that is correct, this offers a new opportunity to better calculate where
> the overhead comes from.
>
> Although I suspect that Warren's box and mine (AMD Ryzen 9 5950X 16-Core)
> are pretty similar, in order to compare apples with apples, I'm going to
> use my own measurements for this case:
>
> $ MALLOC_MMAP_THRESHOLD_=20000000 python prova2.py
> numpy version 1.21.2
>
>  610.0853 microseconds
>  610.7483 microseconds
>  611.1118 microseconds
>
>  369.2480 microseconds
>  369.6340 microseconds
>  369.8093 microseconds
>
> So, in this case, the cost of a minor page fault is ~240 ns ((610us -
> 369us) / 1000 pages).  Now, the times for the original bench in my box:
>
> $ python prova2.py
> numpy version 1.21.2
>
>  598.2740 microseconds
>  609.1857 microseconds
>  609.0713 microseconds
>
>  137.5792 microseconds
>  136.7661 microseconds
>  136.8159 microseconds
>
> Here the temporaries for the second z do not use mmap nor incur into page
> faults, so they are ((369us - 137us) / 1000 pages) = 233ns/page faster, so
> this should be related with zeroing.  IIRC, for my system, memset() can go
> up to 28 GB/s, so that accounts for 146ns of the 233ns. That leaves for an
> additional 87ns/page that, as it is very close to the memory latency of my
> box (around 78ns, see
> https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5),
> I think (wild guessing again) this is somehow related with the memory
> management unit (MMU) operation (but not totally sure).
>

For the record, while being at bed this night I could not get rid of the
feeling that my analysis above was wrong.  Indeed, after reviewing it this
morning, yesterday I got the results reversed: the ~240 ns/page is actually
for memset() (and possibly others), whereas the ~233 ns/page figure is for
page fault instead.  Fortunately both figures are close enough so that the
outcome of the analysis remains barely the same.

At any rate, I am guilty of being too naive in trying to understand (and
measure) where the latencies come from, because memory allocation, and more
in particular, virtual memory, are quite complex beasts to be completely
understood without spending half of your life (just for Linux, see e.g.
https://rigtorp.se/virtual-memory/).

-- 
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Performance mystery

Reply via email to