shuf implementation - should/could it be "reservoir sampling "

Cook, Malcolm Tue, 27 Nov 2012 09:41:22 -0800

There is a discussion at http://news.ycombinator.com/item?id=4833546 comparing 
shuf with dimsum 
(http://blog.noblemail.ca/2012/11/how-to-get-random-lines-out-of-file-or.html)


dimsum uses reservoir sampling so it should have a constant memory footprint 
proportional to -n (though due to a bug with fix in progress, it currently does 
not)

I note that shuf does not have a constant memory footprint.   This can be 
observed, i.e. , with

yes | shuf -n 5

 and running top to watch the memory grow seeming boundlessly.

1)  what is the underlying algorithm ??
2) should/could it be changed to use "reservoir sampling" to obtain benefits of 
constant memory footprint?

Thanks for your consideration!

Malcolm Cook
Computational Biology - Stowers Institute for Medical Research

shuf implementation - should/could it be "reservoir sampling "

Reply via email to