Re: Shuf reservoir sampling

Bernhard Voelker Sat, 28 Dec 2013 04:51:07 -0800

On 12/28/2013 10:53 AM, Jyothis V wrote:
> I would like to know why shuf.c is using reservoir sampling +
> write_permuted_output_reservoir rather than just using an
> inside-out version Fisher-Yates shuffle.


Reservoir sampling has been added to shuf in the youngest coreutils
release 8.22 via this commit:

  http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0

The NEWS file tells us why this is more efficient than before:

  http://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS?id=20d7bce0#n28

  shuf outputs subsets of large inputs much more efficiently.
  Reservoir sampling is used to limit memory usage based on the number of
  outputs, rather than the number of inputs.

This advantage is also mentioned on Wikipedia:

  
http://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-Yates_shuffle

  [...]
  Note that although the rest of the cards are shuffled, only the top k are
  important in the present context. Therefore, the array a need only track
  the cards in the top k positions while performing the shuffle, reducing
  the amount of memory needed. Truncating a to length k, the algorithm is
  modified accordingly:
  [...]

Have a nice day,
Berny

Re: Shuf reservoir sampling

Reply via email to