Thanks a million!! I will check these thoroughly~ Kevin Sheppard <kevin.k.shepp...@gmail.com> 于2022年9月16日周五 16:11写道:
> Have a look at numexpr (https://github.com/pydata/numexpr). It can > achieve large speedups in ops like this at the cost of having to write > expensive operations as strings, e.g., d = ne.evaluate('a * b + c'). You > could also write a gufunc in numba that would be memory and access > efficient. > > Kevin > > > On Fri, Sep 16, 2022 at 8:53 AM 腾刘 <27rabbi...@gmail.com> wrote: > >> Thanks a lot for answering this question but I still have some >> uncertainties. >> >> I 'm trying to improve the time efficiency as much as possible so I 'm >> not mainly worried about memory allocation, since in my opinion it won't >> cost much. >> Instead, the memory accessing is my central concern because of the cache >> miss penalty. >> >> In your snippet, there will be 4 accesses to the whole arrays, which is: >> >> the access to a (in a *= b) >> the access to b (in a *= b) >> the access to c (in c += a) >> the access to a (in c += a) >> >> This is better than d = a * b + c, but I truly need a new array d to hold >> the final result because I don't want to spoil the data in array c also. >> >> So let's replace c += a with d = a + c, and in this way there will be 5 >> accesses to the whole array in total. >> >> However, under optimal conditions, which can be achieved by C++ >> implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to >> the whole array. >> >> In modern CPU, the cost of this kind of simple calculation is negligible >> compared to memory access, so I guess I still need a better way. >> >> So much thanks again for your reply! >> >> Kevin Sheppard <kevin.k.shepp...@gmail.com> 于2022年9月16日周五 15:38写道: >> >>> You can use inplace operators where appropriate to avoid memory >>> allocation. >>> >>> >>> >>> a *= b >>> >>> c += a >>> >>> >>> >>> Kevin >>> >>> >>> >>> >>> >>> *From: *腾刘 <27rabbi...@gmail.com> >>> *Sent: *Friday, September 16, 2022 8:35 AM >>> *To: *Discussion of Numerical Python <numpy-discussion@python.org> >>> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in >>> d=a*b+c? >>> >>> >>> >>> Hello everyone, I 'm here again to ask a naive question about Numpy >>> performance. >>> >>> >>> >>> As far as I know, Numpy's vectorization operator is very effective >>> because it utilizes SIMD instructions and multi-threads compared to >>> index-style programming (using a "for" loop and assigning each element with >>> its index in array). >>> >>> >>> >>> I 'm wondering how fast Numpy could be so I did some experiments. Take >>> this simple task as an example: >>> >>> a = np.random.rand(10 000 000) >>> >>> b = np.random.rand(10 000 000) >>> >>> c = a + b >>> >>> >>> >>> To check the performance, I wrote a simple C++ implementation of adding >>> two arrays using multi-threads too (with the compile options of: -O3 >>> -mavx2). I found that the C++ implementation is slightly faster than Numpy >>> (running 100 times each to get a rather convincing statistic). >>> >>> >>> >>> *Here comes the first question, how come there is this efficiency gap?* >>> >>> I guess this is because Numpy needs to load the shared object and find >>> the wrapper of ufunc and then finally execute the underlying computation. >>> Am I right? Am I missing something here? >>> >>> >>> >>> Then I did another experiment for this statement: d = a * b + c , where >>> a, b, c and d are all numpy arrays. I also use C++ to implement this logic >>> and found that C++ is 2 times faster than Numpy on average (also executed >>> 100 times each). >>> >>> >>> >>> I guess this is because in python we first calculate: >>> >>> temporary_var = a * b >>> >>> and then: >>> >>> d = temporary_var + c >>> >>> so we have an unnecessary memory transfer overhead. Since each array is >>> very large, Numpy needs to write temporary_var to memory and then read it >>> back to cache. >>> >>> >>> >>> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we >>> won't create a temporary array along with the memory transfer penalty. >>> >>> >>> >>> *So another problem is if there is a method to avoid this kind of >>> overhead?* I 've learned that in Numpy we could create our own ufunc >>> with: *frompyfunc*, but it seems that there is no SIMD optimization nor >>> multi-threads utilized since this is 100 times slower than *"d = a * >>> b + c" way*. >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list -- numpy-discussion@python.org >>> To unsubscribe send an email to numpy-discussion-le...@python.org >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> Member address: 27rabbi...@gmail.com >>> >> _______________________________________________ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: kevin.k.shepp...@gmail.com >> > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: 27rabbi...@gmail.com >
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com