On 12/28/2013 10:53 AM, Jyothis V wrote: > I would like to know why shuf.c is using reservoir sampling + > write_permuted_output_reservoir rather than just using an > inside-out version Fisher-Yates shuffle.
Reservoir sampling has been added to shuf in the youngest coreutils release 8.22 via this commit: http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0 The NEWS file tells us why this is more efficient than before: http://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS?id=20d7bce0#n28 shuf outputs subsets of large inputs much more efficiently. Reservoir sampling is used to limit memory usage based on the number of outputs, rather than the number of inputs. This advantage is also mentioned on Wikipedia: http://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-Yates_shuffle [...] Note that although the rest of the cards are shuffled, only the top k are important in the present context. Therefore, the array a need only track the cards in the top k positions while performing the shuffle, reducing the amount of memory needed. Truncating a to length k, the algorithm is modified accordingly: [...] Have a nice day, Berny
