John Darrington <[EMAIL PROTECTED]> writes: > It would be useful to have some benchmark tests, so that we can see > the effect of each of them. Also benchmarking would be good for > marketing purposes.
Of course. (But it can't really be done unless/until we actually implement them.) > On Mon, May 08, 2006 at 10:14:44PM -0700, Ben Pfaff wrote: > As a trivial example, imagine that we want the mean of 1e12 > values. The master program could break the values into 100 > chunk of 1e10 values each, and tell 100 other computers to each > find the mean of a single chunk. Those computers would each > report back their single value and the master program would in > turn take the mean of those 100 values, yielding the overall > mean of all 1e12 values. > > Do you have access to a 100 machine cluster to test it? I have access to a smaller cluster of perhaps 25 machines. > In step 3, we would use threads plus map-reduce or some other > parallel programming model to improve performance on symmetric > multiprocessing (SMP) machines. > > > 4. Take #3 and then extend it, by allowing jobs to be farmed out > not just to threads but to processes running on other > computers as well. I won't speculate on how much work this > would be, but it's clearly a big job. > > If you implement #3 using MPI, then there's nothing to be done for #4. > > However I dabbled in MPI parallel processing a few years ago, and > struggled to come up with a real-life problem which was large enough > to overcome the extra overhead. MPI might be what we want, but I'm not sure. At this point it's just speculation. > Some of the math for statistical procedures would need to be > paralellised inside gsl --- I don't know if gsl supports parallel > execution. For example, most matrix operations can be parallelised > which is worth doing for very large matrices. Good point. > I'm acutely aware, that in some places the code performs very badly. > In particular, operations which involve percentiles are currently > implemented in a very non-optimal manner, and in fact, will probably > exhaust memory if passed very large data sets. Code can always be optimized. > Also, there are opportunities to cache things that procedures use. > Eg: most parametric procedures make use of the data's covariance > matrix. If we can let that persist between procedures, that will > avoid a lot of calculations being repeated ; just so long as we > invalidate that cache when appropriate. Yes, I forgot to put that in my list. It's probably parallel to item #2. > It's not going to be much use having a PSPP which can copy data from A > to B at the speed of light, if any procedures take a year to execute. I think the goal should really be to avoid copying data where possible. Presumably, in the multi-machine case, the data should be stored on a network server or distributed among the machines on local disks, not copied from a master machine to the many machines doing the computation. -- "Mon peu de succès près des femmes est toujours venu de les trop aimer." --Jean-Jacques Rousseau _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
