Re: [OctDev] easy parallelism was Re: optim/leasqr.m style changes, diff

Jaroslav Hajek Mon, 14 Dec 2009 22:24:21 -0800

2009/12/14 Eugeniy Mikhailov <[email protected]>:
>> > But here  is a problem, suppose  you have quite large  input set (let's
>> > say size  of N),  with  which you have to  evaluate  a function  in the
>> >  loop (let's  say  M  times). Now  you  need to  reserve N*M  amount of
>> >  memory. While if  parfor loop you  would need no  more then  M*(number
>> > of   cpu/cores). In   other  words  'parcellfun'  seems  to  be  memory
>> > hungry.
>>
>> Can you  give a more specific  example? Big inputs are  usually generated
>> from smaller ones; so you can simply extend the parallelized part to work
>> with the smaller inputs. I see
>
> Well, unfortunately, generation of that input is quite long, mainly because
> I did not spend much time trying  to vectorize the code. But globals indeed
> could  be  passed  to  parcellfun.  Apparently I  just  had  a  weird  typo
> somewhere.
>


The general rule is that vectorization helps if your loops have
relatively cheap bodies. If there's a big matrix multiplication or
factorization somewhere, vectorization will probably win you
little-to-nothing.

On the contrary, sometimes people really have an unnecessarily loopy
code that can be vectorized for some 20x speed-up, i.e. more than a
parallelization can typically offer.

There are other advantages to vectorization, not often mentioned. As
soon as you master it somewhat, you'll find out that a fully
vectorized code is extremely easy to debug, even without a debugger,
because it becomes just a linear forward sequence of transformations,
which can be checked individually.

>> that  unlike cellfun,  parcellfun  does not  allow auto-expanding  scalar
>> cells as  arguments. I'll add that  feature. But it can  be worked around
>> using  anonymous function;  in  general  there is  no  need to  duplicate
>> inputs.
>
> Could you please  show me an example with anonymous  function, I definitely
> unaware of this trick.
>

Suppose you have N matrices A{1} ... A{N} and you want to calculate
A{i} \ B for a given B and all i.

You can either build up a cell array of copies of B
cellfun (@mldivide, A, {B}(ones (1, N)))

or encapsulate B in an anonymous function

cellfun (@(X) X \ B, A)

and equivalently for parcellfun. Note that the expression {B}(ones (1,
N)), although it creates N copies of B, is not at all inefficient;
Octave uses shallow copying where possible, so that {B}(ones (1, N))
will only occupy the memory for B plus about 8*N bytes or so.

Cellfun also allows you to do

cellfun (@mldivide, A, {B})

for performance reasons (handles to built-in functions are
significantly more efficient than anonymous functions).
parcellfun currently doesn't have this feature, so I think I'll add it.


>>
>> > Unless I miss something, I do not see a way to pass a global
>> > variable to the 'parcellfun' at least it  seems to fail at this stage, also
>> > 'evalin' does not work as well, probably for the same reason.
>> >
>>
>> Surely you can use a global variable in the function being evaluated.
>> At least it should work; if you found a bug, please submit an example.
> As I said above, it is a bug in my test code. My apology for the
> unchecked results.
>
>> The whole Octave  is memory hungry, so if you're  short of memory, Octave
>> may  be  problematic in  general.  Just  for  the record,  for  intensive
>> computations I use  Octave on a machine  with 8 CPUs and 16GB  RAM, and I
>> don't think I ever exceeded 2GB.
>
> My coding machine is  much more humble :) I have just 512  MB of RAM. So we
> have slightly different  definition of memory hungry. Before I  find how to
> put globals to parcellfun,  I had to copy quite big  matrix for every cell,
> which first  of all  eats cpu  cycles and  secondly memory.  Now everything
> seems to be fine.
>

Maybe you did something wrong? As I said, copying a 100MB matrix to
1000 cells should eat only about 8kb of memory. The copies share the
data until a physical copy is needed.

If you show the code (or relevant parts), maybe we'll be able to help.


>
> At our cluster  which I yet to learn  how to use, they have quite  a mix of
> hardware, so I do not know in advance which machine will execute the code.
>
> Matlabs parfor still has some appealing  side, it seems to know about local
> cpu/cores and  can also execute  a code in  a cluster enviroment  on remote
> cpus (for the extra money though). But  in worst case scenario it fall back
> and behave just  like a normal 'for'. But it  seems like 'parcellfun' could
> spread on mosix cluster without extra work as well.
>

Note that parcellfun uses fork()ing, so normally it will be only able
to utilize the CPUs (cores) of a single node, unless your cluster is
equipped with a special software that allows migration of processes
amongst nodes (I've heard some clusters can do this, but I've never
seen it). This is ideal for our cluster where we have 4- and 8-CPU
nodes and typically a person reserves CPUs on just a single node. But
for clusters with many single-CPU nodes (and a fast network),
parcellfun is just useless.

For more general parallelism, there's either the parallel package, or
a very recent (and under development) openMPI package. But then
parallelization is no longer a drop-in replacement of functions.

>
> May be  it make  sense for  'cellfun' to call  'parcellfun' if  some global
> switch is  toggled by user. Of  cause once it is  argument compatible (i.e.
> capable of reiterating scalars as cellfun).
>

No, this is out of question in any near future. The main reason is
that cellfun can not make assumptions about the complexity of the
function being evaluated; for "cheap" functions, cellfun will
significantly outperform parcellfun, because of the overhead of the
parallel setup and communication. It's only expensive functions where
parcellfun pays off, but the function itself simply cannot tell.
Surely there can be an option for that, but then you can as well have
two functions (esp. given that their implementation is very
different).

But in your own code, you can easily achieve the trick yourself by
adding something like the following to the front of the script:

## uncomment the following line to run in parallel, set the number of CPUs
## ncpus = 8; cellfun = @(varargin) parcellfun (ncpus, varargin{:});

note that this will alter *all* cellfun calls; in a more complicated
code, you may instead want to be picky and just parallelize some.

best regards

-- 
RNDr. Jaroslav Hajek
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Octave-dev mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/octave-dev

Re: [OctDev] easy parallelism was Re: optim/leasqr.m style changes, diff

Reply via email to