On Sun, 2021-03-14 at 17:15 +1100, Juan Nunez-Iglesias wrote:
> Hi Pierre,
> 
> If you’re able to compile NumPy locally and you have reliable
> benchmarks, you can write a script that tests the runtime of your
> benchmark and reports it as a test pass/fail. You can then use “git
> bisect run” to automatically find the commit that caused the issue.
> That will help narrow down the discussion before it gets completely
> derailed a second time. 😂
> 
> https://lwn.net/Articles/317154/


Let me share this partial benchmark result for branch I just worked on
in NumPy:

       before           after         ratio
     [c5de5b5c]       [2d9e11ea]
     <main>           <splitup-faster-argparsing>
+     2.12±0.01μs      3.69±0.02μs     1.74  
bench_io.Copy.time_cont_assign('float32')
+     22.6±0.08μs       36.0±0.2μs     1.59  bench_io.CopyTo.time_copyto_sparse
+      49.4±0.8μs       55.2±0.1μs     1.12  
bench_io.CopyTo.time_copyto_8_sparse
-     7.40±0.06μs      4.11±0.01μs     0.56  bench_io.CopyTo.time_copyto_dense
-     6.99±0.05μs         3.77±0μs     0.54  
bench_io.Copy.time_cont_assign('float64')
-     6.94±0.02μs      3.73±0.01μs     0.54  
bench_io.Copy.time_cont_assign('complex64')


That looks weird!  The benchmark sometimes speeds up by a factor of
almost 2, and sometimes the (de-facto) same code slows down by just as
much? (Focus on the `time_cont_assign` with float64 vs. float32).

Even better: I know 100% that no related code is touched!  The core of
that benchmark is just:

     arrya[...] = 1

and I did not even come close to any code related to that operation.


I have, as I did before, tried quite a few things (not as much as in
Victor Stinner's blog when it comes to compiler flags).
Such as enabling/disabling huge-pages, disabling address-space-
randomization (and disabling the NumPy small-array cache).

Note that the results are *stable*, as in: On this branch, I get
extremely reliable results for the benchmark [1]!


As you noticed, I have also seen these (or similar) changes "toggle"
e.g. when copying the array multiple times.  And I have dug down into
profiling one instance on the instruction level with `perf` so I know
for a fact that it is memory access speed.  (Which is a no-brainer
here, the operations are obviously memory or even cache speed bound.)


The point I was hoping to make is: Its complicated, and I am not
holding my breath that you can find an answer without digging much
deeper.
The blog post from Victor Stinner gave me the thought that profile-
guided-optimization *might* be a way to avoid some random fluctuations,
but I have not checked that the inner-loop for the code actually
compiles to different byte-code.


I would hope that someone comes along and "just knows" what is going
on. But, I don't know where to ask or what to google for.

My best bets right now (they may be terrible!) are:

* Profiler guided optimization might help (as in stabilize compiler
output due to *random* changes in code).  Which probably is involved in
some way or another.  But Victor Stinner's timed Python and that may
not have any massively memory bound operations (which are the "big"
things here).

* Maybe try to make the NumPy allocator align all its allocation to
much larger boundaries, such as the CPU cache-line size.  But I think I
tried to check whether alignment seems to matter, and it didn't.  Also,
the arrays feel large enough that it shouldn't matter?

* CPU caching L1/L2 uses a lot of fancy heuristics these days. Maybe to
really understand whats going on, you would have to drill into what the
CPU caches are doing here?


The only thing I do know for sure currently, is that it is a rabbit
hole that I would love to understand, but don't really want to spend
days just to get nowhere.

Cheers,

Sebastian



[1] That run above is without address space randomization, it feels
even more stable than the others.  But that doesn't matter, since we
average in any case, so ASR is probably useless and maybe even
detrimental.


> 
> Juan. 
> 
> > On 13 Mar 2021, at 10:34 am, PIERRE AUGIER <
> > pierre.aug...@univ-grenoble-alpes.fr> wrote:
> > 
> > Hi,
> > 
> > I tried to compile Numpy with `pip install numpy==1.20.1 --no-
> > binary numpy --force-reinstall` and I can reproduce the regression.
> > 
> > Good news, I was able to reproduce the difference with only Numpy
> > 1.20.1. 
> > 
> > Arrays prepared with (`df` is a Pandas dataframe)
> > 
> > arr = df.values.copy()
> > 
> > or 
> > 
> > arr = np.ascontiguousarray(df.values)
> > 
> > lead to "slow" execution while arrays prepared with
> > 
> > arr = np.copy(df.values)
> > 
> > lead to faster execution.
> > 
> > arr.copy() or np.copy(arr) do not give the same result, with arr
> > obtained from a Pandas dataframe with arr = df.values. It's strange
> > because type(df.values) gives <class 'numpy.ndarray'> so I would
> > expect arr.copy() and np.copy(arr) to give exactly the same result.
> > 
> > Note that I think I'm doing quite serious and reproducible
> > benchmarks. I also checked that this regression is reproducible on
> > another computer.
> > 
> > Cheers,
> > 
> > Pierre
> > 
> > ----- Mail original -----
> > > De: "Sebastian Berg" <sebast...@sipsolutions.net>
> > > À: "numpy-discussion" <numpy-discussion@python.org>
> > > Envoyé: Vendredi 12 Mars 2021 22:50:24
> > > Objet: Re: [Numpy-discussion] Looking for a difference between
> > > Numpy 0.19.5 and 0.20 explaining a perf regression with
> > > Pythran
> > 
> > > > On Fri, 2021-03-12 at 21:36 +0100, PIERRE AUGIER wrote:
> > > > Hi,
> > > > 
> > > > I'm looking for a difference between Numpy 0.19.5 and 0.20
> > > > which
> > > > could explain a performance regression (~15 %) with Pythran.
> > > > 
> > > > I observe this regression with the script
> > > > https://github.com/paugier/nbabel/blob/master/py/bench.py
> > > > 
> > > > Pythran reimplements Numpy so it is not about Numpy code for
> > > > computation. However, Pythran of course uses the native array
> > > > contained in a Numpy array. I'm quite sure that something has
> > > > changed
> > > > between Numpy 0.19.5 and 0.20 (or between the corresponding
> > > > wheels?)
> > > > since I don't get the same performance with Numpy 0.20. I
> > > > checked
> > > > that the values in the arrays are the same and that the flags
> > > > characterizing the arrays are also the same.
> > > > 
> > > > Good news, I'm now able to obtain the performance difference
> > > > just
> > > > with Numpy 0.19.5. In this code, I load the data with Pandas
> > > > and need
> > > > to prepare contiguous Numpy arrays to give them to Pythran.
> > > > With
> > > > Numpy 0.19.5, if I use np.copy I get better performance that
> > > > with
> > > > np.ascontiguousarray. With Numpy 0.20, both functions create
> > > > array
> > > > giving the same performance with Pythran (again, less good that
> > > > with
> > > > Numpy 0.19.5).
> > > > 
> > > > Note that this code is very efficient (more that 100 times
> > > > faster
> > > > than using Numpy), so I guess that things like alignment or
> > > > memory
> > > > location can lead to such difference.
> > > > 
> > > > More details in this issue
> > > > https://github.com/serge-sans-paille/pythran/issues/1735
> > > > 
> > > > Any help to understand what has changed would be greatly
> > > > appreciated!
> > > > 
> > > 
> > > If you want to really dig into this, it would be good to do
> > > profiling
> > > to find out at where the differences are.
> > > 
> > > Without that, I don't have much appetite to investigate
> > > personally. The
> > > reason is that fluctuations of ~30% (or even much more) when
> > > running
> > > the NumPy benchmarks are very common.
> > > 
> > > I am not aware of an immediate change in NumPy, especially since
> > > you
> > > are talking pythran, and only the memory space or the interface
> > > code
> > > should matter.
> > > As to the interface code... I would expect it to be quite a bit
> > > faster,
> > > not slower.
> > > There was no change around data allocation, so at best what you
> > > are
> > > seeing is a different pattern in how the "small array cache" ends
> > > up
> > > being used.
> > > 
> > > 
> > > Unfortunately, getting stable benchmarks that reflect code
> > > changes
> > > exactly is tough...  Here is a nice blog post from Victor Stinner
> > > where
> > > he had to go as far as using "profile guided compilation" to
> > > avoid
> > > fluctuations:
> > > 
> > > https://vstinner.github.io/journey-to-stable-benchmark-deadcode.html
> > > 
> > > I somewhat hope that this is also the reason for the huge
> > > fluctuations
> > > we see in the NumPy benchmarks due to absolutely unrelated code
> > > changes.
> > > But I did not have the energy to try it (and a probably fixed bug
> > > in
> > > gcc makes it a bit harder right now).
> > > 
> > > Cheers,
> > > 
> > > Sebastian
> > > 
> > > 
> > > 
> > > 
> > > > Cheers,
> > > > Pierre
> > > > _______________________________________________
> > > > NumPy-Discussion mailing list
> > > > NumPy-Discussion@python.org
> > > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > > > 
> > > 
> > > 
> > > _______________________________________________
> > > NumPy-Discussion mailing list
> > > NumPy-Discussion@python.org
> > > https://mail.python.org/mailman/listinfo/numpy-discussion
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to