You can use inplace operators where appropriate to avoid memory allocation.

 

a *= b

c += a

 

Kevin

 

 

From:
Sent: Friday, September 16, 2022 8:35 AM
To: Discussion of Numerical Python
Subject: [Numpy-discussion] How to avoid this memory copy overhead in d=a*b+c?

 

Hello everyone, I 'm here again to ask a naive question about Numpy performance.

 

As far as I know, Numpy's vectorization operator is very effective because it utilizes SIMD instructions and multi-threads compared to index-style programming (using a "for" loop and assigning each element with its index in array).

 

I 'm wondering how fast Numpy could be so I did some experiments. Take this simple task as an example:

    a = np.random.rand(10 000 000)

    b = np.random.rand(10 000 000)

    c = a + b

 

To check the performance, I wrote a simple C++ implementation of adding two arrays using multi-threads too (with the compile options of: -O3 -mavx2). I found that the C++ implementation is slightly faster than Numpy (running 100 times each to get a rather convincing statistic).

 

Here comes the first question, how come there is this efficiency gap?

I guess this is because Numpy needs to load the shared object and find the wrapper of ufunc and then finally execute the underlying computation. Am I right? Am I missing something here?

 

Then I did another experiment for this statement:  d = a * b + c , where a, b, c and d are all numpy arrays. I also use C++ to implement this logic and found that C++ is 2 times faster than Numpy on average (also executed 100 times each).

 

I guess this is because in python we first calculate:

    temporary_var = a * b

and then:

    d = temporary_var + c

so we have an unnecessary memory transfer overhead. Since each array is very large,  Numpy needs to write temporary_var to memory and then read it back to cache.

 

However in C++ we could just write d[i] = a[i] * b[i] + c[i] and we won't create a temporary array along with the memory transfer penalty.

 

So another problem is if there is a method to avoid this kind of overhead? I 've learned that in Numpy we could create our own ufunc with: frompyfunc, but it seems that there is no SIMD optimization nor multi-threads utilized since this is 100 times slower than "d = a * b + c" way.

 

 

 

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to