Re: musings on performance

Ben Pfaff Tue, 09 May 2006 06:58:28 -0700

John Darrington <[EMAIL PROTECTED]> writes:

> It would be useful to have some benchmark tests, so that we can see
> the effect of each of them.  Also benchmarking would be good for
> marketing purposes.


Of course.  (But it can't really be done unless/until we actually
implement them.)

> On Mon, May 08, 2006 at 10:14:44PM -0700, Ben Pfaff wrote:

>         As a trivial example, imagine that we want the mean of 1e12
>         values.  The master program could break the values into 100
>         chunk of 1e10 values each, and tell 100 other computers to each
>         find the mean of a single chunk.  Those computers would each
>         report back their single value and the master program would in
>         turn take the mean of those 100 values, yielding the overall
>         mean of all 1e12 values.
>
> Do you have access to a 100 machine cluster to test it?

I have access to a smaller cluster of perhaps 25 machines.

>         In step 3, we would use threads plus map-reduce or some other
>         parallel programming model to improve performance on symmetric
>         multiprocessing (SMP) machines.
>      
>      
>      4. Take #3 and then extend it, by allowing jobs to be farmed out
>         not just to threads but to processes running on other
>         computers as well.  I won't speculate on how much work this
>         would be, but it's clearly a big job.
>
> If you implement #3 using MPI, then there's nothing to be done for #4.
>
> However I dabbled in MPI parallel processing a few years ago, and
> struggled to come up with a real-life problem which was large enough
> to overcome the extra overhead.

MPI might be what we want, but I'm not sure.  At this point it's
just speculation.

> Some of the math for statistical procedures would need to be
> paralellised inside gsl --- I don't know if gsl supports parallel
> execution. For example, most matrix operations can be parallelised
> which is worth doing for very large matrices.

Good point.

> I'm acutely aware, that in some places the code performs very badly.
> In particular, operations which involve percentiles are currently
> implemented in a very non-optimal manner, and in fact, will probably
> exhaust memory if passed very large data sets.

Code can always be optimized.

> Also, there are opportunities to cache things that procedures use.
> Eg: most parametric procedures make use of the data's covariance
> matrix.  If we can let that persist between procedures, that will
> avoid a lot of calculations being repeated ;  just so long as we
> invalidate that cache when appropriate.

Yes, I forgot to put that in my list.  It's probably parallel to
item #2.

> It's not going to be much use having a PSPP which can copy data from A
> to B at the speed of light, if any procedures take a year to execute.

I think the goal should really be to avoid copying data where
possible.  Presumably, in the multi-machine case, the data should
be stored on a network server or distributed among the machines
on local disks, not copied from a master machine to the many
machines doing the computation.
-- 
"Mon peu de succès près des femmes est toujours venu de les trop aimer."
--Jean-Jacques Rousseau


_______________________________________________
pspp-dev mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/pspp-dev

Re: musings on performance

Reply via email to