Will Vaughan created DATAFU-11:
----------------------------------

             Summary: ReservoirSample does not behave as expected when grouping 
by a key other than ALL
                 Key: DATAFU-11
                 URL: https://issues.apache.org/jira/browse/DATAFU-11
             Project: DataFu
          Issue Type: Bug
            Reporter: Will Vaughan


ReservoirSample does not behave as expected when grouping by a key other than 
ALL.

It appears like the sample is done on the full input instead of the group input.

Given input:
a1,5
a1,6
a1,7
a2,5
a2,6
a2,7

with the following program
DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2');
data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: 
chararray);
grouped = GROUP data BY key;
sample2 = FOREACH grouped GENERATE ReservoirSample(data);

the expected output should be similar to
(a1, {(a1,5),(a1,7)}
(a2, {(a2,5),(a2,7)}

However, actual output may show up as
(a1, {(a1,5),(a1,7)}
(a2, {(a1,5),(a1,7)}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to