Good to know it is not only on my PC.

I have done a fair bit of work trying to find more efficient sum.

The only faster option that I have found was PyTorch. (although thinking about 
it now maybe it’s because it was using MKL, don’t remember)

MKL is faster, but I use OpenBLAS.

Scipp library is parallelized, and its performance becomes similar to `dotsum` 
for large arrays, but it is slower than numpy or dotsum for size less than 
(somewhere towards) ~200k.

Apart from these I ran out of options and simply implemented my own sum, where 
it uses either `np.sum` or `dotsum` depending on which is faster.

This is the chart, where it can be seen the point where dotsum becomes faster 
than np.sum.
https://gcdnb.pbrd.co/images/j8n3EsRz5g5v.png?o=1 
<https://gcdnb.pbrd.co/images/j8n3EsRz5g5v.png?o=1>

I am not sure how much (and for how many people) the improvement is needed / 
essential, but I have found several stack posts regarding this when I was 
looking into this. It is definitely to me though.

Theoretically, if implemented with same optimisations, sum should be ~2x faster 
than dotsum. Or am I missing something?

Regards,
DG


> On 16 Feb 2024, at 04:54, Homeier, Derek <dhom...@gwdg.de> wrote:
> 
> 
> 
>> On 16 Feb 2024, at 2:48 am, Marten van Kerkwijk <m...@astro.utoronto.ca> 
>> wrote:
>> 
>>> In [45]: %timeit np.add.reduce(a, axis=None)
>>> 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> 
>>> In [43]: %timeit dotsum(a)
>>> 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> 
>>> But theoretically, sum, should be faster than dot product by a fair bit.
>>> 
>>> Isn’t parallelisation implemented for it?
>> 
>> I cannot reproduce that:
>> 
>> In [3]: %timeit np.add.reduce(a, axis=None)
>> 19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>> 
>> In [4]: %timeit dotsum(a)
>> 47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>> 
>> But almost certainly it is indeed due to optimizations, since .dot uses
>> BLAS which is highly optimized (at least on some platforms, clearly
>> better on yours than on mine!).
>> 
>> I thought .sum() was optimized too, but perhaps less so?
> 
> 
> I can confirm at least it does not seem to use multithreading – with the 
> conda-installed numpy+BLAS
> I almost exactly reproduce your numbers, whereas linked against my own 
> OpenBLAS build
> 
> In [3]: %timeit np.add.reduce(a, axis=None)
> 19 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> 
> # OMP_NUM_THREADS=1
> In [4]: %timeit dots(a)
> 20.5 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
> 
> # OMP_NUM_THREADS=8
> In [4]: %timeit dots(a)
> 9.84 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
> 
> add.reduce shows no difference between the two and always remains at <= 100 % 
> CPU usage.
> dotsum is scaling still better with larger matrices, e.g. ~4 x for 1000x1000.
> 
> Cheers,
>                                                       Derek
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to