[Numpy-discussion] Re: Performance mystery

Francesc Alted Mon, 24 Jan 2022 02:04:27 -0800

On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser <warren.weckes...@gmail.com>
wrote:


> Thanks Sebastian for pointing out that the issue is (minor) page
> faults, and thanks Francesc for providing the links and the analysis
> of the situation.  After reading those links and experimenting with
> malloc in a couple C programs, it is clear now what is happening.  The
> key factors to be aware of are:


> * What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how
> they are updated dynamically.
> * Minor page faults are triggered by new allocations from the system,
> whether via mmap or via sbrk(). This means avoiding mmap is not enough
> to avoid potential problems with minor page faults.
>
<snip>

Thanks for going down to the rabbit hole.  It is now much clearer what's
going on.

If we set M_MMAP_THRESHOLD to value large enough that we never use
> mmap, we get results like this (also noted by Francesc):
>
> ```
> $ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py
> numpy version 1.20.3
>
>  639.5976 microseconds
>  641.1489 microseconds
>  640.8228 microseconds
>
>  385.0931 microseconds
>  384.8406 microseconds
>  384.0398 microseconds
> ```
>
> In this case, setting the mmap threshold disables the dynamic updating
> of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024.  This
> means each time free() is called, there will almost certainly be
> enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and
> so each iteration in timeit results memory being allocated from the
> system and returned to the system via sbrk(), and that leads to minor
> page faults when writing to the temporary arrays in each iterations.
> That is, we still get thrashing at the top of the heap.  (Apparenly
> not all the calls of free() lead to trimming the heap, so the
> performance is better with the second z array.)
>

<snip>

This is still something that I don't get.  I suppose this is very minor,
but it would be nice to see the reason for this.

At any rate, with what we know now, one can estimate the origin of the
overhead for the first z array. Going back to the original timing:

 602.1964 microseconds
 601.0663 microseconds
 601.0728 microseconds

 139.5745 microseconds
 139.1559 microseconds
 138.9791 microseconds

The 460 microsecs (600 - 140) is the overhead of dealing with temporaries.
That means that, as there are around 1000 page faults, each page fault
accounts for around 460 nanosecs.  Having into account that memory latency
for this machine is around 80 nanosecs, most of the overhead must come for
other source.  My (wild) guess is that this is the zeroing for each of the
temporary that the OS makes for the anonymous mmap() that it has to make.

All in all, hats off to Warren for this entertaining piece of investigation!
--
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Performance mystery

Reply via email to