Paul O'Leary
Wed, 24 Sep 2008 16:17:48 -0700
Hi All,
I seem to be seeing a problem with the DISTINCT operator. I have a
script that looks like this:
raw_tran_hdr = load 'tran_hdr/tran_header' using PigStorage( '|' ) as (
... many fields ... );
tran_hdr_dist = DISTINCT raw_tran_hdr;
b = GROUP tran_hdr_dist ALL;
c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0);
The data set 'tran_hdr/tran_header' has about 7M rows of which I know
for certain 14 are exact duplicates. When I execute the Pig script
above I get the total row count; that is, the number returned doesn't
correctly drop out the duplicate rows.
There is a thread in the user group about previous DISTINCT problems
that sound just like this but JIRA says they're all resolved. The code
I'm using is up-to-date with the trunk (@ revision 698759) so I'm
assuming I've picked up any fixes.
When (in a different script) I move the DISTINCT into a nested FOREACH
it fixes (or at least works-around) the problem; e.g.:
(after COGROUP)
Z = FOREACH X
{
thd = DISTINCT raw_tran_hdr;
GENERATE
FLATTEN( thd.(... many fields .... ) ),
FLATTEN( sale_line_calc.(... many fields ...) );
}
I will continue to try to dig into the problem but any guidance anyone
can provide would be appreciated. Maybe I'm misunderstanding
something.
As mentioned, I am successfully working around the issue right now but -
as a data junkie like I know you all are - answers that look incorrect
make me nervous.
BTW, I don't think this is just a counting issue with DISTINCT (as the
previous issues seem to allude to); when I tried to use tran_hdr_dist to
do a COGROUP (without counting) I got wrong results.
Thanks,
PaulO.