[
https://issues.apache.org/jira/browse/PIG-2137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054195#comment-13054195
]
Dmitriy V. Ryaboy commented on PIG-2137:
----------------------------------------
For 0.8 I was going to backport PIG-2014 before this one.. we are running both
in production right now (on top of 8.1), they are fine.
Although I did have trouble backporting the tests, a bunch of the optimizer
interfaces seem to have changed. I don't think 8 is as important, since it
doesn't seem likely we'll release 8.2 what with 0.9.0 being almost out the door.
> SAMPLE should not be pushed above DISTINCT
> ------------------------------------------
>
> Key: PIG-2137
> URL: https://issues.apache.org/jira/browse/PIG-2137
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.8.0, 0.8.1, 0.9.0, 0.10
> Reporter: Dmitriy V. Ryaboy
> Assignee: Dmitriy V. Ryaboy
> Priority: Critical
> Attachments: PIG-2137.1.patch, PIG-2137.2.patch, PIG-2137.patch
>
>
> I have an input file that contains 50,000 distinct integers. Each integer is
> repeated twice, for a total of 100,000 lines.
> Script 1, using GROUP BY to get distinct entries in the data, works:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = foreach (group f by $0) generate group;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> grunt> dump n;
> (493)
> {code}
> Script 2, using DISTINCT for the same purpose, allows sampling to be done
> before DISTINCT:
> {code}
> grunt> f = load 'tmp/dupnumbers.txt';
> grunt> d = distinct f;
> grunt> s = sample d 0.01;
> grunt> n = foreach (group s all) generate COUNT(s);
> (980)
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira