pig-user  

DISTINCT Problem

Paul O'Leary
Wed, 24 Sep 2008 16:17:48 -0700

Hi All,

 

I seem to be seeing a problem with the DISTINCT operator.  I have a
script that looks like this:

 

raw_tran_hdr = load 'tran_hdr/tran_header' using PigStorage( '|' ) as (
... many fields ... );

tran_hdr_dist = DISTINCT raw_tran_hdr;

b = GROUP tran_hdr_dist ALL;

c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0);

 

The data set 'tran_hdr/tran_header' has about 7M rows of which I know
for certain 14 are exact duplicates.  When I execute the Pig script
above I get the total row count; that is, the number returned doesn't
correctly drop out the duplicate rows.

 

There is a thread in the user group about previous DISTINCT problems
that sound just like this but JIRA says they're all resolved.  The code
I'm using is up-to-date with the trunk (@ revision 698759) so I'm
assuming I've picked up any fixes.

 

When (in a different script) I move the DISTINCT into a nested FOREACH
it fixes (or at least works-around) the problem; e.g.:

 

(after COGROUP)

 

Z = FOREACH X

{

thd = DISTINCT raw_tran_hdr;

GENERATE 

FLATTEN( thd.(... many fields .... ) ),

FLATTEN( sale_line_calc.(... many fields ...) );

}

 

I will continue to try to dig into the problem but any guidance anyone
can provide would be appreciated.  Maybe I'm misunderstanding
something.

As mentioned, I am successfully working around the issue right now but -
as a data junkie like I know you all are - answers that look incorrect
make me nervous.

 

BTW, I don't think this is just a counting issue with DISTINCT (as the
previous issues seem to allude to); when I tried to use tran_hdr_dist to
do a COGROUP (without counting) I got wrong results.

 

Thanks,

PaulO.