Is this repeatable?
> -----Original Message----- > From: Paul O'Leary [mailto:[EMAIL PROTECTED] > Sent: Thursday, September 25, 2008 10:50 AM > To: [email protected] > Subject: FW: DISTINCT Problem > > Trying to compile types branch to verify/check this problem. > > I get the following compile error and have checked all the obvious > stuff: > > $ ant compile > Buildfile: build.xml > > init: > > cc-compile: > [javacc] Java Compiler Compiler Version 4.0 (Parser Generator) > [javacc] (type "javacc" with no arguments for help) > [javacc] Reading from file > C:\dev\pig\test\org\apache\pig\test\utils\dotGraph > \parser\Dot.jj . . . > [javacc] Exception in thread "main" java.lang.Error: > Invalid escape character at line 1 column 97. > [javacc] at org.javacc.parser.JavaCharStream.readChar(Unknown > Source) > [javacc] at > org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno > wn Source) > [javacc] at > org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source) > [javacc] at > org.javacc.parser.JavaCCParser.javacc_options(Unknown > Source) > > [javacc] at org.javacc.parser.JavaCCParser.javacc_input(Unknown > Source) > [javacc] at org.javacc.parser.Main.mainProgram(Unknown Source) > [javacc] at org.javacc.parser.Main.main(Unknown Source) > > BUILD FAILED > C:\dev\pig\build.xml:151: C:\Program > Files\Java\jdk1.5.0_06\jre\bin\java.exe fai led with return code 1 > > Total time: 5 seconds > > I am compiling on Windows but I get the same error under cygwin. > > Any ideas? Thanks for the help. > PaulO. > > -----Original Message----- > From: Olga Natkovich [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 24, 2008 4:58 PM > To: [EMAIL PROTECTED] > Subject: RE: DISTINCT Problem > > This could be a bug. Can you try it with pig.jar build from > type branch and see if you get the expected results? > > Note that type branch is still on Hadoop 17 but will move to > Hadoop 18 later today. > > Olga > > > -----Original Message----- > > From: Paul O'Leary [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, September 24, 2008 3:57 PM > > To: [EMAIL PROTECTED] > > Subject: DISTINCT Problem > > > > Hi All, > > > > > > > > I seem to be seeing a problem with the DISTINCT operator. I have a > > script that looks like this: > > > > > > > > raw_tran_hdr = load 'tran_hdr/tran_header' using > PigStorage( '|' ) as > > ( ... many fields ... ); > > > > tran_hdr_dist = DISTINCT raw_tran_hdr; > > > > b = GROUP tran_hdr_dist ALL; > > > > c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0); > > > > > > > > The data set 'tran_hdr/tran_header' has about 7M rows of > which I know > > for certain 14 are exact duplicates. When I execute the Pig script > > above I get the total row count; that is, the number > returned doesn't > > correctly drop out the duplicate rows. > > > > > > > > There is a thread in the user group about previous DISTINCT > problems > > that sound just like this but JIRA says they're all resolved. The > > code I'm using is up-to-date with the trunk (@ revision > 698759) so I'm > > assuming I've picked up any fixes. > > > > > > > > When (in a different script) I move the DISTINCT into a > nested FOREACH > > it fixes (or at least works-around) the problem; e.g.: > > > > > > > > (after COGROUP) > > > > > > > > Z = FOREACH X > > > > { > > > > thd = DISTINCT raw_tran_hdr; > > > > GENERATE > > > > FLATTEN( thd.(... many fields .... ) ), > > > > FLATTEN( sale_line_calc.(... many fields ...) ); > > > > } > > > > > > > > I will continue to try to dig into the problem but any > guidance anyone > > can provide would be appreciated. Maybe I'm misunderstanding > > something. > > > > As mentioned, I am successfully working around the issue > right now but > > - as a data junkie like I know you all are - answers that look > > incorrect make me nervous. > > > > > > > > BTW, I don't think this is just a counting issue with > DISTINCT (as the > > previous issues seem to allude to); when I tried to use > tran_hdr_dist > > to do a COGROUP (without counting) I got wrong results. > > > > > > > > Thanks, > > > > PaulO. > > > > > >
