On 06/16/2011 11:44 AM, Christopher Barker wrote:
NOTE: I'm only taking part in this discussion because it's interesting
and I hope to learn something. I do hope the OP chimes back in to
clarify his needs, but in the meantime...
Bruce Southey wrote:
Remember that is what the OP wanted to do, not me.
Actually, I don't think that's what the OP wanted -- I think we have a
conflict between the need for concrete examples, and the desire to
find a generic solution, so I think this is what the OP wants:
How to best multiprocess a _generic_ operation that needs to be
performed on a lot of arrays. Something like:
output = []
for a in a_bunch_of_arrays:
output.append( a_function(a) )
More specifically, a_function() is an inner product, *defined by the
user*.
So there is no way to optimize the inner product itself (that will be
up to the user), nor any way to generally convert the bunch_of_arrays
to a single array with a single higher-dimensional operation.
In testing his approach, the OP used a numpy multiply, and a simple,
loop-through-the elements multiply, and found that with his
multiprocessing calls, the simple loop was a fair bit faster with two
processors, but that the numpy one was slower with two processors. Of
course, the looping method was much, much, slower than the numpy one
in any case.
So Sturla's comments are probably right on:
Sturla Molden wrote:
"innerProductList = pool.map(myutil.numpy_inner_product, arrayList)"
1. Here we potentially have a case of false sharing and/or mutex
contention, as the work is too fine grained. pool.map does not do
any load balancing. If pool.map is to scale nicely, each work item
must take a substantial amount of time. I suspect this is the main
issue.
2. There is also the question of when the process pool is spawned.
Though I haven't checked, I suspect it happens prior to calling
pool.map. But if it does not, this is a factor as well, particularly
on Windows (less so on Linux and Apple).
It didn't work well on my Mac, so ti's either not an issue, or not
Windows-specific, anyway.
3. "arrayList" is serialised by pickling, which has a significan
overhead. It's not shared memory either, as the OP's code implies,
but the main thing is the slowness of cPickle.
I'll bet this is a big issue, and one I'm curious about how to
address, I have another problem where I need to multi-process, and I'd
love to know a way to pass data to the other process and back
*without* going through pickle. maybe memmapped files?
"IPs = N.array(innerProductList)"
4. numpy.array is a very slow function. The benchmark should
preferably not include this overhead.
I re-ran, moving that out of the timing loop, and, indeed, it helped a
lot, but it still takes longer with the multi-processing.
I suspect that the overhead of pickling, etc. is overwhelming the
operation itself. That and the load balancing issue that I don't
understand!
To test this, I did a little experiment -- creating a "fake"
operation, one that simply returns an element from the input array --
so it should take next to no time, and we can time the overhead of the
pickling, etc:
$ python shared_mem.py
Using 2 processes
No shared memory, numpy array multiplication took 0.124427080154 seconds
Shared memory, numpy array multiplication took 0.586215019226 seconds
No shared memory, fake array multiplication took 0.000391006469727
seconds
Shared memory, fake array multiplication took 0.54935503006 seconds
No shared memory, my array multiplication took 23.5055780411 seconds
Shared memory, my array multiplication took 13.0932741165 seconds
Bingo!
The overhead of the multi-processing takes about .54 seconds, which
explains the slowdown for the numpy method
not so mysterious after all.
Bruce Southey wrote:
But if everything is *single-threaded* and thread-safe, then you just
create a function and use Anne's very useful handythread.py
(http://www.scipy.org/Cookbook/Multithreading).
This may be worth a try -- though the GIL could well get in the way.
By the way, if the arrays are sufficiently small, there is a lot of
overhead involved such that there is more time in communication than
computation.
yup -- clearly the case here. I wonder if it's just array size though
-- won't cPickle time scale with array size? So it may not be size
pe-se, but rather how much computation you need for a given size array.
-Chris
[I've enclosed the OP's slightly altered code]
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion
Please see:
http://mail.scipy.org/pipermail/numpy-discussion/2011-June/056766.html
"I'm doing element wise multiplication, basically innerProduct =
numpy.sum(array1*array2) where array1 and array2 are, in general,
multidimensional."
Thanks for the code as I forgot that it was sent.
I think there is something weird about these timings(probably because
these use time and not timeit) - the shared timings should not be
constant across number of processors. For the numpy multiplication
approach shows rather constant differences but using np.inner() clearly
differs with the number of processors used. So I think that numpy may be
using multiple threads here.
It is far more evident with large arrays:
arraySize = (3000,200)
numArrays = 50
Using 1 processes
No shared memory, numpy array multiplication took 0.279149055481 seconds
Shared memory, numpy array multiplication took 1.87239384651 seconds
No shared memory, inner array multiplication took 14.9514381886 seconds
Shared memory, inner array multiplication took 17.0087819099 seconds
Using 4 processes
No shared memory, numpy array multiplication took 0.279071807861 seconds
Shared memory, numpy array multiplication took 1.48242783546 seconds
No shared memory, inner array multiplication took 15.1401138306 seconds
Shared memory, inner array multiplication took 5.2479391098 seconds
Using 8 processes
No shared memory, numpy array multiplication took 0.281194925308 seconds
Shared memory, numpy array multiplication took 1.44942212105 seconds
No shared memory, inner array multiplication took 15.3794519901 seconds
Shared memory, inner array multiplication took 3.51714301109 seconds
Bruce
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion