Good to know it is not only on my PC. I have done a fair bit of work trying to find more efficient sum.
The only faster option that I have found was PyTorch. (although thinking about it now maybe it’s because it was using MKL, don’t remember) MKL is faster, but I use OpenBLAS. Scipp library is parallelized, and its performance becomes similar to `dotsum` for large arrays, but it is slower than numpy or dotsum for size less than (somewhere towards) ~200k. Apart from these I ran out of options and simply implemented my own sum, where it uses either `np.sum` or `dotsum` depending on which is faster. This is the chart, where it can be seen the point where dotsum becomes faster than np.sum. https://gcdnb.pbrd.co/images/j8n3EsRz5g5v.png?o=1 <https://gcdnb.pbrd.co/images/j8n3EsRz5g5v.png?o=1> I am not sure how much (and for how many people) the improvement is needed / essential, but I have found several stack posts regarding this when I was looking into this. It is definitely to me though. Theoretically, if implemented with same optimisations, sum should be ~2x faster than dotsum. Or am I missing something? Regards, DG > On 16 Feb 2024, at 04:54, Homeier, Derek <dhom...@gwdg.de> wrote: > > > >> On 16 Feb 2024, at 2:48 am, Marten van Kerkwijk <m...@astro.utoronto.ca> >> wrote: >> >>> In [45]: %timeit np.add.reduce(a, axis=None) >>> 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >>> >>> In [43]: %timeit dotsum(a) >>> 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >>> >>> But theoretically, sum, should be faster than dot product by a fair bit. >>> >>> Isn’t parallelisation implemented for it? >> >> I cannot reproduce that: >> >> In [3]: %timeit np.add.reduce(a, axis=None) >> 19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) >> >> In [4]: %timeit dotsum(a) >> 47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >> >> But almost certainly it is indeed due to optimizations, since .dot uses >> BLAS which is highly optimized (at least on some platforms, clearly >> better on yours than on mine!). >> >> I thought .sum() was optimized too, but perhaps less so? > > > I can confirm at least it does not seem to use multithreading – with the > conda-installed numpy+BLAS > I almost exactly reproduce your numbers, whereas linked against my own > OpenBLAS build > > In [3]: %timeit np.add.reduce(a, axis=None) > 19 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) > > # OMP_NUM_THREADS=1 > In [4]: %timeit dots(a) > 20.5 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) > > # OMP_NUM_THREADS=8 > In [4]: %timeit dots(a) > 9.84 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each) > > add.reduce shows no difference between the two and always remains at <= 100 % > CPU usage. > dotsum is scaling still better with larger matrices, e.g. ~4 x for 1000x1000. > > Cheers, > Derek > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: dom.grigo...@gmail.com
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com