On Mon, Jan 24, 2022 at 11:01 AM Francesc Alted <fal...@gmail.com> wrote:
> On Mon, Jan 24, 2022 at 2:15 AM Warren Weckesser < > warren.weckes...@gmail.com> wrote: > >> Thanks Sebastian for pointing out that the issue is (minor) page >> faults, and thanks Francesc for providing the links and the analysis >> of the situation. After reading those links and experimenting with >> malloc in a couple C programs, it is clear now what is happening. The >> key factors to be aware of are: > > >> * What the values of M_MMAP_THRESHOLD and M_TRIM_THRESHOLD are and how >> they are updated dynamically. >> * Minor page faults are triggered by new allocations from the system, >> whether via mmap or via sbrk(). This means avoiding mmap is not enough >> to avoid potential problems with minor page faults. >> > <snip> > > Thanks for going down to the rabbit hole. It is now much clearer what's > going on. > > If we set M_MMAP_THRESHOLD to value large enough that we never use >> mmap, we get results like this (also noted by Francesc): >> >> ``` >> $ MALLOC_MMAP_THRESHOLD_=20000000 python timing_question.py >> numpy version 1.20.3 >> >> 639.5976 microseconds >> 641.1489 microseconds >> 640.8228 microseconds >> >> 385.0931 microseconds >> 384.8406 microseconds >> 384.0398 microseconds >> ``` >> >> In this case, setting the mmap threshold disables the dynamic updating >> of all the thresholds, so M_TRIM_THRESHOLD remains 128*1024. This >> means each time free() is called, there will almost certainly be >> enough memory at the end of the heap that exceeds M_TRIM_THRESHOLD and >> so each iteration in timeit results memory being allocated from the >> system and returned to the system via sbrk(), and that leads to minor >> page faults when writing to the temporary arrays in each iterations. >> That is, we still get thrashing at the top of the heap. (Apparenly >> not all the calls of free() lead to trimming the heap, so the >> performance is better with the second z array.) >> > > <snip> > > This is still something that I don't get. I suppose this is very minor, > but it would be nice to see the reason for this. > FWIW, after some more reflection upon this, I think that a possible explanation for the above is that in both cases, the temporaries are returned using mmap(), and hence both incurring in the zeroing overhead. If that is correct, this offers a new opportunity to better calculate where the overhead comes from. Although I suspect that Warren's box and mine (AMD Ryzen 9 5950X 16-Core) are pretty similar, in order to compare apples with apples, I'm going to use my own measurements for this case: $ MALLOC_MMAP_THRESHOLD_=20000000 python prova2.py numpy version 1.21.2 610.0853 microseconds 610.7483 microseconds 611.1118 microseconds 369.2480 microseconds 369.6340 microseconds 369.8093 microseconds So, in this case, the cost of a minor page fault is ~240 ns ((610us - 369us) / 1000 pages). Now, the times for the original bench in my box: $ python prova2.py numpy version 1.21.2 598.2740 microseconds 609.1857 microseconds 609.0713 microseconds 137.5792 microseconds 136.7661 microseconds 136.8159 microseconds Here the temporaries for the second z do not use mmap nor incur into page faults, so they are ((369us - 137us) / 1000 pages) = 233ns/page faster, so this should be related with zeroing. IIRC, for my system, memset() can go up to 28 GB/s, so that accounts for 146ns of the 233ns. That leaves for an additional 87ns/page that, as it is very close to the memory latency of my box (around 78ns, see https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5), I think (wild guessing again) this is somehow related with the memory management unit (MMU) operation (but not totally sure). I am pretty sure we are not yet there for understanding everything completely (e.g. I did leave CPU caches out of the analysis), but hey, it is a lot of fun going down to the rabbit holes indeed. -- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com