This is exactly what numexpr is meant for: https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/
In particular, see these benchmarks (made around 10 years ago, but they should still apply): https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/intro.html#expected-performance Cheers On Fri, Sep 16, 2022 at 9:57 AM 腾刘 <27rabbi...@gmail.com> wrote: > Thanks a lot for answering this question but I still have some > uncertainties. > > I 'm trying to improve the time efficiency as much as possible so I 'm not > mainly worried about memory allocation, since in my opinion it won't cost > much. > Instead, the memory accessing is my central concern because of the cache > miss penalty. > > In your snippet, there will be 4 accesses to the whole arrays, which is: > > the access to a (in a *= b) > the access to b (in a *= b) > the access to c (in c += a) > the access to a (in c += a) > > This is better than d = a * b + c, but I truly need a new array d to hold > the final result because I don't want to spoil the data in array c also. > > So let's replace c += a with d = a + c, and in this way there will be 5 > accesses to the whole array in total. > > However, under optimal conditions, which can be achieved by C++ > implementation ( d[i] = a[i] * b[i] + c[i]), we only need four accesses to > the whole array. > > In modern CPU, the cost of this kind of simple calculation is negligible > compared to memory access, so I guess I still need a better way. > > So much thanks again for your reply! > > Kevin Sheppard <kevin.k.shepp...@gmail.com> 于2022年9月16日周五 15:38写道: > >> You can use inplace operators where appropriate to avoid memory >> allocation. >> >> >> >> a *= b >> >> c += a >> >> >> >> Kevin >> >> >> >> >> >> *From: *腾刘 <27rabbi...@gmail.com> >> *Sent: *Friday, September 16, 2022 8:35 AM >> *To: *Discussion of Numerical Python <numpy-discussion@python.org> >> *Subject: *[Numpy-discussion] How to avoid this memory copy overhead in >> d=a*b+c? >> >> >> >> Hello everyone, I 'm here again to ask a naive question about Numpy >> performance. >> >> >> >> As far as I know, Numpy's vectorization operator is very effective >> because it utilizes SIMD instructions and multi-threads compared to >> index-style programming (using a "for" loop and assigning each element with >> its index in array). >> >> >> >> I 'm wondering how fast Numpy could be so I did some experiments. Take >> this simple task as an example: >> >> a = np.random.rand(10 000 000) >> >> b = np.random.rand(10 000 000) >> >> c = a + b >> >> >> >> To check the performance, I wrote a simple C++ implementation of adding >> two arrays using multi-threads too (with the compile options of: -O3 >> -mavx2). I found that the C++ implementation is slightly faster than Numpy >> (running 100 times each to get a rather convincing statistic). >> >> >> >> *Here comes the first question, how come there is this efficiency gap?* >> >> I guess this is because Numpy needs to load the shared object and find >> the wrapper of ufunc and then finally execute the underlying computation. >> Am I right? Am I missing something here? >> >> >> >> Then I did another experiment for this statement: d = a * b + c , where >> a, b, c and d are all numpy arrays. I also use C++ to implement this logic >> and found that C++ is 2 times faster than Numpy on average (also executed >> 100 times each). >> >> >> >> I guess this is because in python we first calculate: >> >> temporary_var = a * b >> >> and then: >> >> d = temporary_var + c >> >> so we have an unnecessary memory transfer overhead. Since each array is >> very large, Numpy needs to write temporary_var to memory and then read it >> back to cache. >> >> >> >> However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't >> create a temporary array along with the memory transfer penalty. >> >> >> >> *So another problem is if there is a method to avoid this kind of >> overhead?* I 've learned that in Numpy we could create our own ufunc >> with: *frompyfunc*, but it seems that there is no SIMD optimization nor >> multi-threads utilized since this is 100 times slower than *"d = a * b + >> c" way*. >> >> >> >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list -- numpy-discussion@python.org >> To unsubscribe send an email to numpy-discussion-le...@python.org >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >> Member address: 27rabbi...@gmail.com >> > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: fal...@gmail.com > -- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com