[
https://issues.apache.org/jira/browse/CRUNCH-178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Wills updated CRUNCH-178:
------------------------------
Attachment: CRUNCH-178d.patch
I think that we want to distinguish between "seed not given" (and hence
null-valued) and "seed = 0" in this context. We're making some compromises in
the Sample.sample method to ensure that we have a consistent view of the
backing dataset, e.g., if we have:
PCollection<T> input = ...;
PCollection<T> sampled = Sample.sample(input, 0.05);
...then we want/expect that the "sampled" PCollection should have the same
contents no matter when we run a MapReduce over it. This requires that we
create a seed at the time that Sample.sample is called. The rub of doing this
is that the sample we create won't be truly random: since all of the partitions
use the same seed, they'll all generate the same sequence of random numbers,
which means that we'll see the same "slice" of each partition of the data. That
said, I believe that this lack of randomness is necessary here to preserve the
idea that a PCollection is truly immutable. We could do something fancy here,
like adding a salt based on the task ID, if it ever became a real issue.
In the reservoir sampling case, we don't have this restriction: reservoir
sampling kicks off a MR job, so the PCollection that is returned will be
materialized on disk somewhere, and so the view of it will already be
immutable. Therefore, we are free to be "more" random here, and use a different
Random instance (with a different seed) for all of the partitions of the data.
Aside from that, javadoc'd properly in the attached patch, mostly via copy and
paste, and fixed the '<' characters. Don't apologize for nits, it's the only
way we're ever going to get this stuff cleaned up.
> Add library functions for performing distributed reservoir sampling
> -------------------------------------------------------------------
>
> Key: CRUNCH-178
> URL: https://issues.apache.org/jira/browse/CRUNCH-178
> Project: Crunch
> Issue Type: Improvement
> Components: MapReduce Patterns
> Reporter: Josh Wills
> Attachments: CRUNCH-178b.patch, CRUNCH-178c.patch, CRUNCH-178d.patch,
> CRUNCH-178.patch
>
>
> For a project I've been working on, I wrote up some Crunch functions for
> performing reservoir sampling and weighted reservoir sampling that I think
> would be useful enough to put in lib.* Here's the paper that I used as a
> reference for the implementations I wrote:
> http://arxiv.org/pdf/1012.0256.pdf
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira