On 06/16/2011 02:05 PM, Brandt Belson wrote:
Hi all,
Thanks for the replies. As mentioned, I'm parallelizing so that I can
take many inner products simultaneously (which I agree is
embarrassingly parallel). The library I'm writing asks the user to
supply a function that takes two objects and returns their inner
product. After all the discussion though it seems this is too
simplistic of an approach. Instead, I plan to write this part of the
library as if the inner product function supplied by the user uses all
available cores (with numpy and/or numexpr built with MKL or LAPACK).
As far as using fortran or C and openMP, this probably isn't worth the
time it would take, both for me and the user.
I've tried increasing the array sizes and found the same trends, so
the slowdown isn't only because the arrays are too small to see the
benefit of multiprocessing. I wrote the code to be easy for anyone to
experiment with, so feel free to play around with what is included in
the profiling, the sizes of arrays, functions used, etc.
I also tried using handythread.foreach with arraySize = (3000,1000),
and found the following:
No shared memory, numpy array multiplication took 1.57585811615 seconds
Shared memory, numpy array multiplication took 1.25499510765 seconds
This is definitely an improvement from multiprocessing, but without
knowing any better, I was hoping to see a roughly 8x speedup on my
8-core workstation.
Based on what Chris sent, it seems there is some large overhead caused
by multiprocessing pickling numpy arrays. To test what Robin mentioned
> If you are on Linux or Mac then fork works nicely so you have read
> only shared memory you just have to put it in a module before the fork
> (so before pool = Pool() ) and then all the subprocesses can access it
> without any pickling required. ie
> myutil.data = listofdata
> p = multiprocessing.Pool(8)
> def mymapfunc(i):
> return mydatafunc(myutil.data[i])
>
> p.map(mymapfunc, range(len(myutil.data)))
I tried creating the arrayList in the myutil module and using
multiprocessing to find the inner products of myutil.arrayList,
however this was still slower than not using multiprocessing, so I
believe there is still some large overhead. Here are the results:
No shared memory, numpy array multiplication took 1.55906510353 seconds
Shared memory, numpy array multiplication took 9.82426381111 seconds
Shared memory, myutil.arrayList numpy array multiplication took
8.77094507217 seconds
I'm attaching this code.
I'm going to work around this numpy/multiprocessing behavior with
numpy/numexpr built with MKL or LAPACK. It would be good to know
exactly what's causing this though. It would be nice if there was a
way to get the ideal speedup via multiprocessing, regardless of the
internal workings of the single-threaded inner product function, as
this was the behavior I expected. I imagine other people might come
across similar situations, but again I'm going to try to get around
this by letting MKL or LAPACK make use of all available cores.
Thanks again,
Brandt
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
I think this is not being benchmarked correctly because there should be
a noticeable different when different number of threads are selected.
But really you should read these sources:
http://www.scipy.org/ParallelProgramming
http://stackoverflow.com/questions/5260068/multithreaded-blas-in-python-numpy
Also numpy has extra things going on like checks and copies that
probably make using np.inner() slower. Thus, your 'numpy_inner_product'
is probably as efficient as you can get without extreme measures like
cython.
Bruce
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion