[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

腾刘 Fri, 16 Sep 2022 01:18:25 -0700

Thanks a million!! I will check these thoroughly~

Kevin Sheppard <kevin.k.shepp...@gmail.com> 于2022年9月16日周五 16:11写道：


> Have a look at numexpr (https://github.com/pydata/numexpr).  It can
> achieve large speedups in ops like this at the cost of having to write
> expensive operations as strings, e.g., d = ne.evaluate('a * b + c').  You
> could also write a gufunc in numba that would be memory and access
> efficient.
>
> Kevin
>
>
> On Fri, Sep 16, 2022 at 8:53 AM 腾刘 <27rabbi...@gmail.com> wrote:
>
>> Thanks a lot for answering this question but I still have some
>> uncertainties.
>>
>> I 'm trying to improve the time efficiency as much as possible so I 'm
>> not mainly worried about memory allocation, since in my opinion it won't
>> cost much.
>> Instead, the memory accessing is my central concern because of the cache
>> miss penalty.
>>
>> In your snippet, there will be 4 accesses to the whole arrays, which is:
>>
>> the access to a (in a *= b)
>> the access to b (in a *= b)
>> the access to c (in c += a)
>> the access to a (in c += a)
>>
>> This is better than d = a * b + c, but I truly need a new array d to hold
>> the final result because I don't want to spoil the data in array c also.
>>
>> So let's replace c += a with d = a + c, and in this way there will be 5
>> accesses to the whole array in total.
>>
>> However, under optimal conditions, which can be achieved by C++
>> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to
>> the whole array.
>>
>> In modern CPU, the cost of this kind of simple calculation is negligible
>> compared to memory access, so I guess I still need a better way.
>>
>> So much thanks again for your reply!
>>
>> Kevin Sheppard <kevin.k.shepp...@gmail.com> 于2022年9月16日周五 15:38写道：
>>
>>> You can use inplace operators where appropriate to avoid memory
>>> allocation.
>>>
>>>
>>>
>>> a *= b
>>>
>>> c += a
>>>
>>>
>>>
>>> Kevin
>>>
>>>
>>>
>>>
>>>
>>> *From: *腾刘 <27rabbi...@gmail.com>
>>> *Sent: *Friday, September 16, 2022 8:35 AM
>>> *To: *Discussion of Numerical Python <numpy-discussion@python.org>
>>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in
>>> d=a*b+c?
>>>
>>>
>>>
>>> Hello everyone, I 'm here again to ask a naive question about Numpy
>>> performance.
>>>
>>>
>>>
>>> As far as I know, Numpy's vectorization operator is very effective
>>> because it utilizes SIMD instructions and multi-threads compared to
>>> index-style programming (using a "for" loop and assigning each element with
>>> its index in array).
>>>
>>>
>>>
>>> I 'm wondering how fast Numpy could be so I did some experiments. Take
>>> this simple task as an example:
>>>
>>>     a = np.random.rand(10 000 000)
>>>
>>>     b = np.random.rand(10 000 000)
>>>
>>>     c = a + b
>>>
>>>
>>>
>>> To check the performance, I wrote a simple C++ implementation of adding
>>> two arrays using multi-threads too (with the compile options of: -O3
>>> -mavx2). I found that the C++ implementation is slightly faster than Numpy
>>> (running 100 times each to get a rather convincing statistic).
>>>
>>>
>>>
>>> *Here comes the first question, how come there is this efficiency gap?*
>>>
>>> I guess this is because Numpy needs to load the shared object and find
>>> the wrapper of ufunc and then finally execute the underlying computation.
>>> Am I right? Am I missing something here?
>>>
>>>
>>>
>>> Then I did another experiment for this statement:  d = a * b + c , where
>>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic
>>> and found that C++ is 2 times faster than Numpy on average (also executed
>>> 100 times each).
>>>
>>>
>>>
>>> I guess this is because in python we first calculate:
>>>
>>>     temporary_var = a * b
>>>
>>> and then:
>>>
>>>     d = temporary_var + c
>>>
>>> so we have an unnecessary memory transfer overhead. Since each array is
>>> very large,  Numpy needs to write temporary_var to memory and then read it
>>> back to cache.
>>>
>>>
>>>
>>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we
>>> won't create a temporary array along with the memory transfer penalty.
>>>
>>>
>>>
>>> *So another problem is if there is a method to avoid this kind of
>>> overhead?* I 've learned that in Numpy we could create our own ufunc
>>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor
>>> multi-threads utilized since this is 100 times slower than *"d = a *
>>> b + c" way*.
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: 27rabbi...@gmail.com
>>>
>> _______________________________________________
>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>> To unsubscribe send an email to numpy-discussion-le...@python.org
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: kevin.k.shepp...@gmail.com
>>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: 27rabbi...@gmail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: How to avoid this memory copy overhead in d=a*b+c?

Reply via email to