[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

Francesc Alted Fri, 16 Sep 2022 01:12:45 -0700

This is exactly what numexpr is meant for:
https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/


In particular, see these benchmarks (made around 10 years ago, but they
should still apply):
https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/intro.html#expected-performance

Cheers

On Fri, Sep 16, 2022 at 9:57 AM 腾刘 <27rabbi...@gmail.com> wrote:

> Thanks a lot for answering this question but I still have some
> uncertainties.
>
> I 'm trying to improve the time efficiency as much as possible so I 'm not
> mainly worried about memory allocation, since in my opinion it won't cost
> much.
> Instead, the memory accessing is my central concern because of the cache
> miss penalty.
>
> In your snippet, there will be 4 accesses to the whole arrays, which is:
>
> the access to a (in a *= b)
> the access to b (in a *= b)
> the access to c (in c += a)
> the access to a (in c += a)
>
> This is better than d = a * b + c, but I truly need a new array d to hold
> the final result because I don't want to spoil the data in array c also.
>
> So let's replace c += a with d = a + c, and in this way there will be 5
> accesses to the whole array in total.
>
> However, under optimal conditions, which can be achieved by C++
> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to
> the whole array.
>
> In modern CPU, the cost of this kind of simple calculation is negligible
> compared to memory access, so I guess I still need a better way.
>
> So much thanks again for your reply!
>
> Kevin Sheppard <kevin.k.shepp...@gmail.com> 于2022年9月16日周五 15:38写道：
>
>> You can use inplace operators where appropriate to avoid memory
>> allocation.
>>
>>
>>
>> a *= b
>>
>> c += a
>>
>>
>>
>> Kevin
>>
>>
>>
>>
>>
>> *From: *腾刘 <27rabbi...@gmail.com>
>> *Sent: *Friday, September 16, 2022 8:35 AM
>> *To: *Discussion of Numerical Python <numpy-discussion@python.org>
>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in
>> d=a*b+c?
>>
>>
>>
>> Hello everyone, I 'm here again to ask a naive question about Numpy
>> performance.
>>
>>
>>
>> As far as I know, Numpy's vectorization operator is very effective
>> because it utilizes SIMD instructions and multi-threads compared to
>> index-style programming (using a "for" loop and assigning each element with
>> its index in array).
>>
>>
>>
>> I 'm wondering how fast Numpy could be so I did some experiments. Take
>> this simple task as an example:
>>
>>     a = np.random.rand(10 000 000)
>>
>>     b = np.random.rand(10 000 000)
>>
>>     c = a + b
>>
>>
>>
>> To check the performance, I wrote a simple C++ implementation of adding
>> two arrays using multi-threads too (with the compile options of: -O3
>> -mavx2). I found that the C++ implementation is slightly faster than Numpy
>> (running 100 times each to get a rather convincing statistic).
>>
>>
>>
>> *Here comes the first question, how come there is this efficiency gap?*
>>
>> I guess this is because Numpy needs to load the shared object and find
>> the wrapper of ufunc and then finally execute the underlying computation.
>> Am I right? Am I missing something here?
>>
>>
>>
>> Then I did another experiment for this statement:  d = a * b + c , where
>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic
>> and found that C++ is 2 times faster than Numpy on average (also executed
>> 100 times each).
>>
>>
>>
>> I guess this is because in python we first calculate:
>>
>>     temporary_var = a * b
>>
>> and then:
>>
>>     d = temporary_var + c
>>
>> so we have an unnecessary memory transfer overhead. Since each array is
>> very large,  Numpy needs to write temporary_var to memory and then read it
>> back to cache.
>>
>>
>>
>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't
>> create a temporary array along with the memory transfer penalty.
>>
>>
>>
>> *So another problem is if there is a method to avoid this kind of
>> overhead?* I 've learned that in Numpy we could create our own ufunc
>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor
>> multi-threads utilized since this is 100 times slower than *"d = a * b +
>> c" way*.
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: 27rabbi...@gmail.com
>>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: fal...@gmail.com
>


-- 
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

Reply via email to