On Fri, May 30, 2008 at 01:58:17PM -0700, Ben Pfaff wrote: > Ed <[EMAIL PROTECTED]> writes: > > > (To avoid polluting the standard thread I copied this part across) > > > >> 2008/5/30 Ben Pfaff <[EMAIL PROTECTED]>: > >> But I'm not sure why you think that supporting large data sets > >> has to be highly complex. A lot of it is just Computer Science > >> 101 type stuff; for example, merge sorts for efficiently sorting > >> large data sets (which we already implement). What do you have > >> in mind? > > > > I guess I was thinking of matrix operations in particular. I've been > > reading the linear regression code as Jason suggested, and the best > > way to implement linear algebra on a large dataset isn't immediately > > clear to me. > > You're right. We do not have any provisions for out-of-core > linear algebra. Whether this is an important omission, I will > have to defer to others.
Out-of-core linear algebra is a big problem, and I don't know if it's something we want to deal with. Right now, the regression procedure reads the entire data set, which is not good for large data sets, so you would think PSPP should be able to do out-of-core linear algebra. But there are two ways to make a large data set: Lots of cases, or lots of variables. They should be handled differently. If we have lots of cases, but not lots of variables, we don't need any out-of-core linear algebra - we can use GSL's CBLAS or write our own procedures if they're simple. Most statistical procedures don't need to keep the whole data set around, only sufficient statistics like a covariance matrix. So if we have a data set with n cases and p variables, and n is large but p is not, then we can pass the data once to get the sufficient statistics, and keep those in memory to do the remaining computations. The covariance matrix, for example, will be p by p, which is small enough to hold in memory, and contains enough information to estimate via least squares. If it's an iterative procedure where the covariance matrix must be updated, we could just pass through the data again and update the covariance matrix. If a data set has n cases and p variables, and p is large, then the covariance matrix will be enormous, and we won't be able to hold it memory. I still argue against trying to add out-of-core linear algebra for these reasons: First, developing all those procedures is going to be a lot of work. Any good linear algebra package takes a lot of coding and time. Take a look at GSL's CBLAS or Lapack routines. Second, out-of-core algorithms will be very slow. Third, out-of-core algorithms aren't necessary most of the time. They are only necessary when the number of variables is very large. Fourth, instances when PSPP would need out-of-core linear algebra are those in which the user is likely to be doing something they shouldn't be doing. Why would anyone fit a linear model with 5000 variables? Many of those variables are correlated, causing conditioning problems on the matrices, which make computations harder. With 5000 variables in a linear model, the user probably hasn't bothered to think about which variables are really explanatory and which just happen to be there. There would probably be other problems, too. Despite sounding judgemental here, I do think we should make PSPP able to do the things users want, when feasible. But given that such algorithms are slow, not often necessary, hard to develop and maintain, and likely to be used for inappropriate analyses, why not just stick with typical algorithms? We can use GSL's CBLAS all we want, and if we want to optimize a particular procedure, we can implement an algorithm from Golub and Van Loan. I believe most procedures will perform fine if they can just pass the data once and build sufficient statistics along the way. That will still make PSPP able to handle very large data sets. And if the number of variables turns out to be a limitation for a lot of users, we can change the code later anyway. -Jason _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
