I was mostly interested at the start in adding stats functions. For example I have written various anova/ancova/glm routines to analyse my own experimental results (they'd have to be rewritten, but at least I'd only do it once instead of rolling my own every time I make a new project). I have some stuff like multi-dimensional scaling and various classifiers lying around too. Oh and I have an implementation of affinity clustering from the paper in nature last year (I wanted to try to improve the space bound on their algorithm, but it turned out they "underplayed" the relative importance of some of the features, so my efforts to improve the memory usage stalled).
I started by reading src/language/stats/oneway.q to see how the existing anova was done, but what strikes me is it would be very time consuming and inefficient to go through 1000+ lines of code (plus more from preprocessing) for each new anova-like function. Most of the code here seems to fall into one of: 1/ table generation 2/ argument parsing 3/ statistic generation 4/ infrastructure Ideally it would all be in 3 For 1, I wonder if some kind of template system would work: have a template language that you can define table layouts in, with suitable field names or whatever so that the code can just call make_table(template_file, heres_my_data). For 2, I don't really like generating per-command parsers with preprocessing, unless the parser is very sophisticated. I did work on a commercial codebase that had a centralised lexer/parser once. To add a new function as far as I remember you basically defined a new function token in the parser, and a new routine for it to call somewhere else. Arguments were handled by stack magic I think; not that I'm advocating that, but something along these lines is definitely possible, and reduces the per-function overhead greatly. Another possibility is bison (which again decouples things). For 4, it would take someone much more familiar with the codebase to know how to reduce the amount of marshalling and piping and command callouts. In seemingly similar situations I've been in before, more powerful (more specific) driver routines solved these problems (ie the routines had wrappers that did all the necessary infrastructure assuming the values had been calculated correctly). These are just my thoughts at the moment; obviously it'll take me quite a while to become familiar with the codebase, and my estimates at the moment are probably uneducated. Ed 2008/5/27 John Darrington <[EMAIL PROTECTED]>: > On Mon, May 26, 2008 at 02:56:47AM +0100, Ed wrote: > The situation is pretty simple. I thought i'd see if I could > contribute to pspp, and the first step is to pull cvs to make sure > you're looking at the newest code. > > That's good to hear. Is there any particular area that you're > interested in working on? > > J' > > -- > PGP Public key ID: 1024D/2DE827B3 > fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3 > See http://pgp.mit.edu or any PGP keyserver for public key. > > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.6 (GNU/Linux) > > iD8DBQFIO1TCimdxnC3oJ7MRApLUAJ0fpQ9CtbxdGzOnN7hK8LpkwkOAiwCbBHoN > 9IdaqxxEtwtXxzd4JrXo+lk= > =f23c > -----END PGP SIGNATURE----- > > _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
