[Martin Maechler] > FrPi> Suppose the file (or tape) holds N records (N is not known > FrPi> in advance), from which we want a sample of M records at > FrPi> most. [...] If the algorithm is carefully designed, when > FrPi> the last (N'th) record of the file will have been processed > FrPi> this way, we may then have M records randomly selected from > FrPi> N records, in such a a way that each of the N records had an > FrPi> equal probability to end up in the selection of M records. I > FrPi> may seek out for details if needed.
>[...] I'm also intrigued about the details of the algorithm you >outline above. I went into my old SPSS books and related references to find it for you, to no avail (yet I confess I did not try very hard). I vaguely remember it was related to Spearman's correlation computation: I did find notes about the "severe memory limitation" of this computation, but nothing about the implemented workaround. I did find other sampling devices, but not the very one I remember having read about, many years ago. On the other hand, Googling tells that this topic has been much studied, and that Vitter's algorithm Z seems to be popular nowadays (even if not the simplest) because it is more efficient than others. Google found a copy of the paper: http://www.cs.duke.edu/~jsv/Papers/Vit85.Reservoir.pdf Here is an implementation for Postgres: http://svr5.postgresql.org/pgsql-patches/2004-05/msg00319.php yet I do not find it very readable -- but this is only an opinion: I'm rather demanding in the area of legibility, while many or most people are more courageous than me! :-). -- François Pinard http://pinard.progiciels-bpi.ca ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
