Will Vaughan created DATAFU-11: ---------------------------------- Summary: ReservoirSample does not behave as expected when grouping by a key other than ALL Key: DATAFU-11 URL: https://issues.apache.org/jira/browse/DATAFU-11 Project: DataFu Issue Type: Bug Reporter: Will Vaughan
ReservoirSample does not behave as expected when grouping by a key other than ALL. It appears like the sample is done on the full input instead of the group input. Given input: a1,5 a1,6 a1,7 a2,5 a2,6 a2,7 with the following program DEFINE ReservoirSample datafu.pig.sampling.ReservoirSample('2'); data = LOAD 'input.txt' USING PigStorage(',') AS (key: chararray, value: chararray); grouped = GROUP data BY key; sample2 = FOREACH grouped GENERATE ReservoirSample(data); the expected output should be similar to (a1, {(a1,5),(a1,7)} (a2, {(a2,5),(a2,7)} However, actual output may show up as (a1, {(a1,5),(a1,7)} (a2, {(a1,5),(a1,7)} -- This message was sent by Atlassian JIRA (v6.1.5#6160)