Re: Shuf reservoir sampling

2014-01-23 Thread Assaf Gordon

Hello,

(reviving an old thread, sorry for the delayed response).

 On 12/28/2013 03:36 PM, Jyothis V wrote:
...

Hi, thanks for the reply. I understand why something like reservoir
sampling is needed. But in shuf.c, shuffling is done in two steps:
1) using reservoir sampling, an array of length head_length is obtained.
At this stage, the array is not completely shuffled because the
first head_length elements were copied verbatim. 2) This array is
then randomly permuted while writing. My question is whether these
two steps could be clubbed together, just as shown in the second
algorithm given in the wikipedia page you mentioned.


I didn't have a look into the Maths behind yet, nor was I involved
during that last improvement. Further improvement is maybe possible,
and the best way to push this is providing code. Are you willing
to propose such a patch?



Regarding the shuffle correctness:
Yes, the data is first read into the array, and only later permuted.
I believe the implementation is correct (ie it does randomly shuffles 
the input), and if this is not the case, it's a bug and should be fixed.


Regarding the implementation:
In shuf.c there's an intricate interplay between reading the input and 
writing the output - notice that the input is closed explicitly half-way 
through main(), before any output is ever written.


The initial patch was written to maintain this behavior, and minimally 
disrupt the existing code flow:

http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0f7e57d9a98f0ee811e31c757e9fedfff

This is not to say a better implementation is not possible, just that 
there are few technical details to note before changing 'shuf'.


There are certainly ways to improve the code.

HTH,
 -Gordon




Re: Shuf reservoir sampling

2013-12-29 Thread Pádraig Brady
On 12/28/2013 05:13 PM, Jyothis V wrote:
 Hi, thanks for the reply. I understand why something like reservoir sampling 
 is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir 
 sampling, an array of length head_length is obtained. At this stage, the 
 array is not completely shuffled because the first head_length elements were 
 copied verbatim. 2) This array is then randomly permuted while writing. My 
 question is whether these two steps could be clubbed together, just as shown 
 in the second algorithm given in the wikipedia page you mentioned.

Yes there are more improvements possible in this area.

Also we should avoid the more CPU intensive reservoir sampling
in the common case of reading small inputs from stdin:
http://lists.gnu.org/archive/html/coreutils/2013-03/msg00057.html
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/shuf.c;h=456140#l272

Also reservoir sampling is currently not used with --repeat.
But it does seem possible to use reservoir-sampling with replacement:
https://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf

thanks,
Pádraig.



Shuf reservoir sampling

2013-12-28 Thread Jyothis V
Hi,
 I would like to know why shuf.c is using reservoir sampling + 
write_permuted_output_reservoir rather than just using an inside-out version 
Fisher-Yates shuffle.

-- 
Jyothis V jyoth...@gmx.com



Re: Shuf reservoir sampling

2013-12-28 Thread Bernhard Voelker
On 12/28/2013 10:53 AM, Jyothis V wrote:
 I would like to know why shuf.c is using reservoir sampling +
 write_permuted_output_reservoir rather than just using an
 inside-out version Fisher-Yates shuffle.

Reservoir sampling has been added to shuf in the youngest coreutils
release 8.22 via this commit:

  http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0

The NEWS file tells us why this is more efficient than before:

  http://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS?id=20d7bce0#n28

  shuf outputs subsets of large inputs much more efficiently.
  Reservoir sampling is used to limit memory usage based on the number of
  outputs, rather than the number of inputs.

This advantage is also mentioned on Wikipedia:

  
http://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-Yates_shuffle

  [...]
  Note that although the rest of the cards are shuffled, only the top k are
  important in the present context. Therefore, the array a need only track
  the cards in the top k positions while performing the shuffle, reducing
  the amount of memory needed. Truncating a to length k, the algorithm is
  modified accordingly:
  [...]

Have a nice day,
Berny



Re: Shuf reservoir sampling

2013-12-28 Thread Bernhard Voelker
Hi Jyothis,

On 12/28/2013 03:36 PM, Jyothis V wrote:
 On Sat, 28 Dec 2013 13:49:53 +0100
 Bernhard Voelker m...@bernhard-voelker.de wrote:
 
 On 12/28/2013 10:53 AM, Jyothis V wrote:
 I would like to know why shuf.c is using reservoir sampling +
 write_permuted_output_reservoir rather than just using an
 inside-out version Fisher-Yates shuffle.

 Reservoir sampling has been added to shuf in the youngest coreutils
 release 8.22 via this commit:

   http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0

CCing Assaf who is the author of that commit.

 Hi, thanks for the reply. I understand why something like reservoir
 sampling is needed. But in shuf.c, shuffling is done in two steps:
 1) using reservoir sampling, an array of length head_length is obtained.
 At this stage, the array is not completely shuffled because the
 first head_length elements were copied verbatim. 2) This array is
 then randomly permuted while writing. My question is whether these
 two steps could be clubbed together, just as shown in the second
 algorithm given in the wikipedia page you mentioned.

I didn't have a look into the Maths behind yet, nor was I involved
during that last improvement. Further improvement is maybe possible,
and the best way to push this is providing code. Are you willing
to propose such a patch?

Thanks  have a nice day,
Berny



Re: Shuf reservoir sampling

2013-12-28 Thread Jyothis V
Hi Bernhard, I'd be happy to do that. But first, I just want to make sure that 
I'm not missing anything.


On Sat, 28 Dec 2013 18:18:27 +0100
Bernhard Voelker m...@bernhard-voelker.de wrote:

 Hi Jyothis,
 
 On 12/28/2013 03:36 PM, Jyothis V wrote:
  On Sat, 28 Dec 2013 13:49:53 +0100
  Bernhard Voelker m...@bernhard-voelker.de wrote:
  
  On 12/28/2013 10:53 AM, Jyothis V wrote:
  I would like to know why shuf.c is using reservoir sampling +
  write_permuted_output_reservoir rather than just using an
  inside-out version Fisher-Yates shuffle.
 
  Reservoir sampling has been added to shuf in the youngest coreutils
  release 8.22 via this commit:
 
http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0
 
 CCing Assaf who is the author of that commit.
 
  Hi, thanks for the reply. I understand why something like reservoir
  sampling is needed. But in shuf.c, shuffling is done in two steps:
  1) using reservoir sampling, an array of length head_length is obtained.
  At this stage, the array is not completely shuffled because the
  first head_length elements were copied verbatim. 2) This array is
  then randomly permuted while writing. My question is whether these
  two steps could be clubbed together, just as shown in the second
  algorithm given in the wikipedia page you mentioned.
 
 I didn't have a look into the Maths behind yet, nor was I involved
 during that last improvement. Further improvement is maybe possible,
 and the best way to push this is providing code. Are you willing
 to propose such a patch?
 
 Thanks  have a nice day,
 Berny


-- 
Jyothis V jyoth...@gmx.com