SAMPLE should not be pushed above DISTINCT
------------------------------------------
Key: PIG-2137
URL: https://issues.apache.org/jira/browse/PIG-2137
Project: Pig
Issue Type: Bug
Affects Versions: 0.8.1, 0.8.0, 0.9.0, 0.10
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Critical
I have an input file that contains 50,000 distinct integers. Each integer is
repeated twice, for a total of 100,000 lines.
Script 1, using GROUP BY to get distinct entries in the data, works:
{code}
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = foreach (group f by $0) generate group;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
grunt> dump n;
(493)
{code}
Script 2, using DISTINCT for the same purpose, allows sampling to be done
before DISTINCT:
grunt> f = load 'tmp/dupnumbers.txt';
grunt> d = distinct f;
grunt> s = sample d 0.01;
grunt> n = foreach (group s all) generate COUNT(s);
(980)
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira