(To avoid polluting the standard thread I copied this part across) > 2008/5/30 Ben Pfaff <[EMAIL PROTECTED]>: > But I'm not sure why you think that supporting large data sets > has to be highly complex. A lot of it is just Computer Science > 101 type stuff; for example, merge sorts for efficiently sorting > large data sets (which we already implement). What do you have > in mind?
I guess I was thinking of matrix operations in particular. I've been reading the linear regression code as Jason suggested, and the best way to implement linear algebra on a large dataset isn't immediately clear to me. My recollection of the maths in this area is a little hazy. The obvious cases we might need seem to be: vector <dot> vector: is straightforward (stream in matching chunks of the two vectors). working memory is 2* chunk size + 1 (for the total) matrix x vector the matrix could be wide or high (we assume its very nonsquare since otherwise no algorithm can probably run in useful time anyway). All this basically changes is the order we stream things in. If wide, we stream across the matrix, calculating all n result elements simultaneously. If high, we stream down the matrix, calculating each of the n result elements one after another and saving the result to an output stream. memory usage: wide: n totals + 2n chunks. high: 1 total + m locations (much cheaper I guess) matrix x matrix In the case (n x m) x (m x p) where n,p <<< m, can we build the entire result matrix as we stream the inputs across m? The other cases I'm not thinking so clearly about... Is there an existing library that would do this for us? This is the sort of thing that seems complicated to me anyway. Ed _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
