Re: Shuf reservoir sampling
Hello, (reviving an old thread, sorry for the delayed response). On 12/28/2013 03:36 PM, Jyothis V wrote: ... Hi, thanks for the reply. I understand why something like reservoir sampling is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir sampling, an array of length head_length is obtained. At this stage, the array is not completely shuffled because the first head_length elements were copied verbatim. 2) This array is then randomly permuted while writing. My question is whether these two steps could be clubbed together, just as shown in the second algorithm given in the wikipedia page you mentioned. I didn't have a look into the Maths behind yet, nor was I involved during that last improvement. Further improvement is maybe possible, and the best way to push this is providing code. Are you willing to propose such a patch? Regarding the shuffle correctness: Yes, the data is first read into the array, and only later permuted. I believe the implementation is correct (ie it does randomly shuffles the input), and if this is not the case, it's a bug and should be fixed. Regarding the implementation: In shuf.c there's an intricate interplay between reading the input and writing the output - notice that the input is closed explicitly half-way through main(), before any output is ever written. The initial patch was written to maintain this behavior, and minimally disrupt the existing code flow: http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0f7e57d9a98f0ee811e31c757e9fedfff This is not to say a better implementation is not possible, just that there are few technical details to note before changing 'shuf'. There are certainly ways to improve the code. HTH, -Gordon
Re: Shuf reservoir sampling
On 12/28/2013 05:13 PM, Jyothis V wrote: Hi, thanks for the reply. I understand why something like reservoir sampling is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir sampling, an array of length head_length is obtained. At this stage, the array is not completely shuffled because the first head_length elements were copied verbatim. 2) This array is then randomly permuted while writing. My question is whether these two steps could be clubbed together, just as shown in the second algorithm given in the wikipedia page you mentioned. Yes there are more improvements possible in this area. Also we should avoid the more CPU intensive reservoir sampling in the common case of reading small inputs from stdin: http://lists.gnu.org/archive/html/coreutils/2013-03/msg00057.html http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/shuf.c;h=456140#l272 Also reservoir sampling is currently not used with --repeat. But it does seem possible to use reservoir-sampling with replacement: https://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf thanks, Pádraig.
Shuf reservoir sampling
Hi, I would like to know why shuf.c is using reservoir sampling + write_permuted_output_reservoir rather than just using an inside-out version Fisher-Yates shuffle. -- Jyothis V jyoth...@gmx.com
Re: Shuf reservoir sampling
On 12/28/2013 10:53 AM, Jyothis V wrote: I would like to know why shuf.c is using reservoir sampling + write_permuted_output_reservoir rather than just using an inside-out version Fisher-Yates shuffle. Reservoir sampling has been added to shuf in the youngest coreutils release 8.22 via this commit: http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0 The NEWS file tells us why this is more efficient than before: http://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS?id=20d7bce0#n28 shuf outputs subsets of large inputs much more efficiently. Reservoir sampling is used to limit memory usage based on the number of outputs, rather than the number of inputs. This advantage is also mentioned on Wikipedia: http://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-Yates_shuffle [...] Note that although the rest of the cards are shuffled, only the top k are important in the present context. Therefore, the array a need only track the cards in the top k positions while performing the shuffle, reducing the amount of memory needed. Truncating a to length k, the algorithm is modified accordingly: [...] Have a nice day, Berny
Re: Shuf reservoir sampling
Hi Jyothis, On 12/28/2013 03:36 PM, Jyothis V wrote: On Sat, 28 Dec 2013 13:49:53 +0100 Bernhard Voelker m...@bernhard-voelker.de wrote: On 12/28/2013 10:53 AM, Jyothis V wrote: I would like to know why shuf.c is using reservoir sampling + write_permuted_output_reservoir rather than just using an inside-out version Fisher-Yates shuffle. Reservoir sampling has been added to shuf in the youngest coreutils release 8.22 via this commit: http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0 CCing Assaf who is the author of that commit. Hi, thanks for the reply. I understand why something like reservoir sampling is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir sampling, an array of length head_length is obtained. At this stage, the array is not completely shuffled because the first head_length elements were copied verbatim. 2) This array is then randomly permuted while writing. My question is whether these two steps could be clubbed together, just as shown in the second algorithm given in the wikipedia page you mentioned. I didn't have a look into the Maths behind yet, nor was I involved during that last improvement. Further improvement is maybe possible, and the best way to push this is providing code. Are you willing to propose such a patch? Thanks have a nice day, Berny
Re: Shuf reservoir sampling
Hi Bernhard, I'd be happy to do that. But first, I just want to make sure that I'm not missing anything. On Sat, 28 Dec 2013 18:18:27 +0100 Bernhard Voelker m...@bernhard-voelker.de wrote: Hi Jyothis, On 12/28/2013 03:36 PM, Jyothis V wrote: On Sat, 28 Dec 2013 13:49:53 +0100 Bernhard Voelker m...@bernhard-voelker.de wrote: On 12/28/2013 10:53 AM, Jyothis V wrote: I would like to know why shuf.c is using reservoir sampling + write_permuted_output_reservoir rather than just using an inside-out version Fisher-Yates shuffle. Reservoir sampling has been added to shuf in the youngest coreutils release 8.22 via this commit: http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0 CCing Assaf who is the author of that commit. Hi, thanks for the reply. I understand why something like reservoir sampling is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir sampling, an array of length head_length is obtained. At this stage, the array is not completely shuffled because the first head_length elements were copied verbatim. 2) This array is then randomly permuted while writing. My question is whether these two steps could be clubbed together, just as shown in the second algorithm given in the wikipedia page you mentioned. I didn't have a look into the Maths behind yet, nor was I involved during that last improvement. Further improvement is maybe possible, and the best way to push this is providing code. Are you willing to propose such a patch? Thanks have a nice day, Berny -- Jyothis V jyoth...@gmx.com