RE: DISTINCT Problem

Olga Natkovich Thu, 25 Sep 2008 11:18:53 -0700

Is this repeatable?


> -----Original Message-----
> From: Paul O'Leary [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, September 25, 2008 10:50 AM
> To: [email protected]
> Subject: FW: DISTINCT Problem
> 
> Trying to compile types branch to verify/check this problem.
> 
> I get the following compile error and have checked all the obvious
> stuff:
> 
> $ ant compile
> Buildfile: build.xml
> 
> init:
> 
> cc-compile:
>    [javacc] Java Compiler Compiler Version 4.0 (Parser Generator)
>    [javacc] (type "javacc" with no arguments for help)
>    [javacc] Reading from file
> C:\dev\pig\test\org\apache\pig\test\utils\dotGraph
> \parser\Dot.jj . . .
>    [javacc] Exception in thread "main" java.lang.Error: 
> Invalid escape character  at line 1 column 97.
>    [javacc]     at org.javacc.parser.JavaCharStream.readChar(Unknown
> Source)
>    [javacc]     at
> org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno
> wn Source)
>    [javacc]     at 
> org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source)
>    [javacc]     at 
> org.javacc.parser.JavaCCParser.javacc_options(Unknown
> Source)
> 
>    [javacc]     at org.javacc.parser.JavaCCParser.javacc_input(Unknown
> Source)
>    [javacc]     at org.javacc.parser.Main.mainProgram(Unknown Source)
>    [javacc]     at org.javacc.parser.Main.main(Unknown Source)
> 
> BUILD FAILED
> C:\dev\pig\build.xml:151: C:\Program
> Files\Java\jdk1.5.0_06\jre\bin\java.exe fai led with return code 1
> 
> Total time: 5 seconds
> 
> I am compiling on Windows but I get the same error under cygwin.
> 
> Any ideas?  Thanks for the help.
> PaulO.
> 
> -----Original Message-----
> From: Olga Natkovich [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 24, 2008 4:58 PM
> To: [EMAIL PROTECTED]
> Subject: RE: DISTINCT Problem
> 
> This could be a bug. Can you try it with pig.jar build from 
> type branch and see if you get the expected results?
> 
> Note that type branch is still on Hadoop 17 but will move to 
> Hadoop 18 later today. 
> 
> Olga
> 
> > -----Original Message-----
> > From: Paul O'Leary [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, September 24, 2008 3:57 PM
> > To: [EMAIL PROTECTED]
> > Subject: DISTINCT Problem
> > 
> > Hi All,
> > 
> >  
> > 
> > I seem to be seeing a problem with the DISTINCT operator.  I have a 
> > script that looks like this:
> > 
> >  
> > 
> > raw_tran_hdr = load 'tran_hdr/tran_header' using 
> PigStorage( '|' ) as 
> > ( ... many fields ... );
> > 
> > tran_hdr_dist = DISTINCT raw_tran_hdr;
> > 
> > b = GROUP tran_hdr_dist ALL;
> > 
> > c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0);
> > 
> >  
> > 
> > The data set 'tran_hdr/tran_header' has about 7M rows of 
> which I know 
> > for certain 14 are exact duplicates.  When I execute the Pig script 
> > above I get the total row count; that is, the number 
> returned doesn't 
> > correctly drop out the duplicate rows.
> > 
> >  
> > 
> > There is a thread in the user group about previous DISTINCT 
> problems 
> > that sound just like this but JIRA says they're all resolved.  The 
> > code I'm using is up-to-date with the trunk (@ revision 
> 698759) so I'm 
> > assuming I've picked up any fixes.
> > 
> >  
> > 
> > When (in a different script) I move the DISTINCT into a 
> nested FOREACH 
> > it fixes (or at least works-around) the problem; e.g.:
> > 
> >  
> > 
> > (after COGROUP)
> > 
> >  
> > 
> > Z = FOREACH X
> > 
> > {
> > 
> > thd = DISTINCT raw_tran_hdr;
> > 
> > GENERATE
> > 
> > FLATTEN( thd.(... many fields .... ) ),
> > 
> > FLATTEN( sale_line_calc.(... many fields ...) );
> > 
> > }
> > 
> >  
> > 
> > I will continue to try to dig into the problem but any 
> guidance anyone 
> > can provide would be appreciated.  Maybe I'm misunderstanding 
> > something.
> > 
> > As mentioned, I am successfully working around the issue 
> right now but 
> > - as a data junkie like I know you all are - answers that look 
> > incorrect make me nervous.
> > 
> >  
> > 
> > BTW, I don't think this is just a counting issue with 
> DISTINCT (as the 
> > previous issues seem to allude to); when I tried to use 
> tran_hdr_dist 
> > to do a COGROUP (without counting) I got wrong results.
> > 
> >  
> > 
> > Thanks,
> > 
> > PaulO.
> > 
> > 
> 
>

RE: DISTINCT Problem

Reply via email to