[ 
https://issues.apache.org/jira/browse/CRUNCH-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Wills updated CRUNCH-178:
------------------------------

    Attachment: CRUNCH-178d.patch

I think that we want to distinguish between "seed not given" (and hence 
null-valued) and "seed = 0" in this context. We're making some compromises in 
the Sample.sample method to ensure that we have a consistent view of the 
backing dataset, e.g., if we have:

PCollection<T> input = ...;
PCollection<T> sampled = Sample.sample(input, 0.05);

...then we want/expect that the "sampled" PCollection should have the same 
contents no matter when we run a MapReduce over it. This requires that we 
create a seed at the time that Sample.sample is called. The rub of doing this 
is that the sample we create won't be truly random: since all of the partitions 
use the same seed, they'll all generate the same sequence of random numbers, 
which means that we'll see the same "slice" of each partition of the data. That 
said, I believe that this lack of randomness is necessary here to preserve the 
idea that a PCollection is truly immutable. We could do something fancy here, 
like adding a salt based on the task ID, if it ever became a real issue.

In the reservoir sampling case, we don't have this restriction: reservoir 
sampling kicks off a MR job, so the PCollection that is returned will be 
materialized on disk somewhere, and so the view of it will already be 
immutable. Therefore, we are free to be "more" random here, and use a different 
Random instance (with a different seed) for all of the partitions of the data.

Aside from that, javadoc'd properly in the attached patch, mostly via copy and 
paste, and fixed the '<' characters. Don't apologize for nits, it's the only 
way we're ever going to get this stuff cleaned up.
                
> Add library functions for performing distributed reservoir sampling
> -------------------------------------------------------------------
>
>                 Key: CRUNCH-178
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-178
>             Project: Crunch
>          Issue Type: Improvement
>          Components: MapReduce Patterns
>            Reporter: Josh Wills
>         Attachments: CRUNCH-178b.patch, CRUNCH-178c.patch, CRUNCH-178d.patch, 
> CRUNCH-178.patch
>
>
> For a project I've been working on, I wrote up some Crunch functions for 
> performing reservoir sampling and weighted reservoir sampling that I think 
> would be useful enough to put in lib.* Here's the paper that I used as a 
> reference for the implementations I wrote:
> http://arxiv.org/pdf/1012.0256.pdf

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to