[
https://issues.apache.org/jira/browse/DATAFU-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880104#comment-13880104
]
Matthew Hayes commented on DATAFU-11:
-------------------------------------
For Initial.exec the reservoir is only used when the input bag has more tuples
than the desired sample size. Usually tuples are passed into Initial.exec one
at a time, so we'll avoid using the reservoir and clearing it, which would be
expensive. Clearing the reservoir results in iterating through the backing
array. For accumulate we only need to clear the reservoir in cleanup,
otherwise it would not behave as an accumulator :) Now that I'm looking at it,
I see we are nulling out the reservoir in cleanup instead of clearing it.
Clearing it would be more efficient I think because it would involve less
garbage collection. I'll change this.
> ReservoirSample does not behave as expected when grouping by a key other than
> ALL
> ---------------------------------------------------------------------------------
>
> Key: DATAFU-11
> URL: https://issues.apache.org/jira/browse/DATAFU-11
> Project: DataFu
> Issue Type: Bug
> Reporter: Will Vaughan
> Assignee: Matthew Hayes
> Attachments: DATAFU-11.patch
>
>
> Reported by Barbara Mucha ([Issue #92 on
> GitHub|https://github.com/linkedin/datafu/issues/92]):
> ReservoirSample does not behave as expected when grouping by a key other than
> ALL.
> It appears like the sample is done on the full input instead of the group
> input.
> Given input:
> {noformat}
> a1,5
> a1,6
> a1,7
> a2,5
> a2,6
> a2,7
> {noformat}
> with the following program
> {noformat}
> DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2');
> data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value:
> chararray);
> grouped = GROUP data BY key;
> sample2 = FOREACH grouped GENERATE ReservoirSample(data);
> {noformat}
> the expected output should be similar to
> {noformat}
> (a1, {(a1,5),(a1,7)}
> (a2, {(a2,5),(a2,7)}
> {noformat}
> However, actual output may show up as
> {noformat}
> (a1, {(a1,5),(a1,7)}
> (a2, {(a1,5),(a1,7)}
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)