Olga Natkovich
Wed, 24 Sep 2008 16:59:35 -0700
This could be a bug. Can you try it with pig.jar build from type branch
and see if you get the expected results?
Note that type branch is still on Hadoop 17 but will move to Hadoop 18
later today.
Olga
> -----Original Message-----
> From: Paul O'Leary [EMAIL PROTECTED]
> Sent: Wednesday, September 24, 2008 3:57 PM
> To: pig-user@incubator.apache.org
> Subject: DISTINCT Problem
>
> Hi All,
>
>
>
> I seem to be seeing a problem with the DISTINCT operator. I
> have a script that looks like this:
>
>
>
> raw_tran_hdr = load 'tran_hdr/tran_header' using PigStorage(
> '|' ) as ( ... many fields ... );
>
> tran_hdr_dist = DISTINCT raw_tran_hdr;
>
> b = GROUP tran_hdr_dist ALL;
>
> c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0);
>
>
>
> The data set 'tran_hdr/tran_header' has about 7M rows of
> which I know for certain 14 are exact duplicates. When I
> execute the Pig script above I get the total row count; that
> is, the number returned doesn't correctly drop out the duplicate rows.
>
>
>
> There is a thread in the user group about previous DISTINCT
> problems that sound just like this but JIRA says they're all
> resolved. The code I'm using is up-to-date with the trunk (@
> revision 698759) so I'm assuming I've picked up any fixes.
>
>
>
> When (in a different script) I move the DISTINCT into a
> nested FOREACH it fixes (or at least works-around) the problem; e.g.:
>
>
>
> (after COGROUP)
>
>
>
> Z = FOREACH X
>
> {
>
> thd = DISTINCT raw_tran_hdr;
>
> GENERATE
>
> FLATTEN( thd.(... many fields .... ) ),
>
> FLATTEN( sale_line_calc.(... many fields ...) );
>
> }
>
>
>
> I will continue to try to dig into the problem but any
> guidance anyone can provide would be appreciated. Maybe I'm
> misunderstanding something.
>
> As mentioned, I am successfully working around the issue
> right now but - as a data junkie like I know you all are -
> answers that look incorrect make me nervous.
>
>
>
> BTW, I don't think this is just a counting issue with
> DISTINCT (as the previous issues seem to allude to); when I
> tried to use tran_hdr_dist to do a COGROUP (without counting)
> I got wrong results.
>
>
>
> Thanks,
>
> PaulO.
>
>