Re: Shuf reservoir sampling

Bernhard Voelker Sat, 28 Dec 2013 09:19:40 -0800

Hi Jyothis,

On 12/28/2013 03:36 PM, Jyothis V wrote:
> On Sat, 28 Dec 2013 13:49:53 +0100
> Bernhard Voelker <[email protected]> wrote:
> 
>> On 12/28/2013 10:53 AM, Jyothis V wrote:
>>> I would like to know why shuf.c is using reservoir sampling +
>>> write_permuted_output_reservoir rather than just using an
>>> inside-out version Fisher-Yates shuffle.
>>
>> Reservoir sampling has been added to shuf in the youngest coreutils
>> release 8.22 via this commit:
>>
>>   http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0


CCing Assaf who is the author of that commit.

> Hi, thanks for the reply. I understand why something like reservoir
> sampling is needed. But in shuf.c, shuffling is done in two steps:
> 1) using reservoir sampling, an array of length head_length is obtained.
> At this stage, the array is not completely shuffled because the
> first head_length elements were copied verbatim. 2) This array is
> then randomly permuted while writing. My question is whether these
> two steps could be clubbed together, just as shown in the second
> algorithm given in the wikipedia page you mentioned.

I didn't have a look into the Maths behind yet, nor was I involved
during that last improvement. Further improvement is maybe possible,
and the best way to push this is providing code. Are you willing
to propose such a patch?

Thanks & have a nice day,
Berny

Re: Shuf reservoir sampling

Reply via email to