Trying to compile types branch to verify/check this problem. I get the following compile error and have checked all the obvious stuff:
$ ant compile Buildfile: build.xml init: cc-compile: [javacc] Java Compiler Compiler Version 4.0 (Parser Generator) [javacc] (type "javacc" with no arguments for help) [javacc] Reading from file C:\dev\pig\test\org\apache\pig\test\utils\dotGraph \parser\Dot.jj . . . [javacc] Exception in thread "main" java.lang.Error: Invalid escape character at line 1 column 97. [javacc] at org.javacc.parser.JavaCharStream.readChar(Unknown Source) [javacc] at org.javacc.parser.JavaCCParserTokenManager.getNextToken(Unkno wn Source) [javacc] at org.javacc.parser.JavaCCParser.jj_ntk(Unknown Source) [javacc] at org.javacc.parser.JavaCCParser.javacc_options(Unknown Source) [javacc] at org.javacc.parser.JavaCCParser.javacc_input(Unknown Source) [javacc] at org.javacc.parser.Main.mainProgram(Unknown Source) [javacc] at org.javacc.parser.Main.main(Unknown Source) BUILD FAILED C:\dev\pig\build.xml:151: C:\Program Files\Java\jdk1.5.0_06\jre\bin\java.exe fai led with return code 1 Total time: 5 seconds I am compiling on Windows but I get the same error under cygwin. Any ideas? Thanks for the help. PaulO. -----Original Message----- From: Olga Natkovich [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 4:58 PM To: [EMAIL PROTECTED] Subject: RE: DISTINCT Problem This could be a bug. Can you try it with pig.jar build from type branch and see if you get the expected results? Note that type branch is still on Hadoop 17 but will move to Hadoop 18 later today. Olga > -----Original Message----- > From: Paul O'Leary [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 24, 2008 3:57 PM > To: [EMAIL PROTECTED] > Subject: DISTINCT Problem > > Hi All, > > > > I seem to be seeing a problem with the DISTINCT operator. I > have a script that looks like this: > > > > raw_tran_hdr = load 'tran_hdr/tran_header' using PigStorage( > '|' ) as ( ... many fields ... ); > > tran_hdr_dist = DISTINCT raw_tran_hdr; > > b = GROUP tran_hdr_dist ALL; > > c = FOREACH b GENERATE COUNT(tran_hdr_dist.$0); > > > > The data set 'tran_hdr/tran_header' has about 7M rows of > which I know for certain 14 are exact duplicates. When I > execute the Pig script above I get the total row count; that > is, the number returned doesn't correctly drop out the duplicate rows. > > > > There is a thread in the user group about previous DISTINCT > problems that sound just like this but JIRA says they're all > resolved. The code I'm using is up-to-date with the trunk (@ > revision 698759) so I'm assuming I've picked up any fixes. > > > > When (in a different script) I move the DISTINCT into a > nested FOREACH it fixes (or at least works-around) the problem; e.g.: > > > > (after COGROUP) > > > > Z = FOREACH X > > { > > thd = DISTINCT raw_tran_hdr; > > GENERATE > > FLATTEN( thd.(... many fields .... ) ), > > FLATTEN( sale_line_calc.(... many fields ...) ); > > } > > > > I will continue to try to dig into the problem but any > guidance anyone can provide would be appreciated. Maybe I'm > misunderstanding something. > > As mentioned, I am successfully working around the issue > right now but - as a data junkie like I know you all are - > answers that look incorrect make me nervous. > > > > BTW, I don't think this is just a counting issue with > DISTINCT (as the previous issues seem to allude to); when I > tried to use tran_hdr_dist to do a COGROUP (without counting) > I got wrong results. > > > > Thanks, > > PaulO. > >
