That's a parser bug, it's treating all semicolons as semicolons, without paying attention to quoting.
Please open a ticket on the Pig Jira. In the meantime, there's a workaround: delimited = FOREACH lines GENERATE FLATTEN (REGEXEXTRACTALL(line, '^(\\d+)\\u003B(\\w+)$')) AS (digit:int,word:chararray) On Mon, Aug 30, 2010 at 11:38 AM, Christopher Hackman < christopher.hack...@gmail.com> wrote: > I'm attempting to parse some log files using the RegexExtractAll function > in > the piggybank. Everything was going along swimmingly until I tried to > include an expression which contains a semi-colon. > > Here's the short, reproducible version of what I'm trying to do... > > Given an input file: > > /test1.txt (in the hdfs) > 1;a > 2;b > 3;c > 4;d > 5;e > > > And the following Pig script: > > REGISTER /tmp/piggybank.jar ; > DEFINE REGEXEXTRACTALL > org.apache.pig.piggybank.evaluation.string.RegexExtractAll(); > lines = LOAD '/test1.txt' AS (line:chararray); > delimited = FOREACH lines GENERATE FLATTEN ( > REGEXEXTRACTALL(line, '^(\\d+);(\\w+)$') > ) AS ( > digit:int, > word:chararray > ); > DUMP delimited; > > > I receive the following error: > > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. > Lexical error at line 5, column 40. Encountered: <EOF> after : > "\'^(\\\\d+);" > > > If I change the source file to use commas (or pipes, or dashes, etc...) and > change the regex accordingly, it works as expected. It looks to me like Pig > is not parsing the regex string correctly, and is assuming that the > semi-colon (even though it's part of a quoted string) is an EOL character. > I've tried escaping the semi-colon, putting another at the end of the > REGEXEXTRACTALL line, etc... nothing seems to prevent Pig from dying. > > Can anyone tell me if I'm missing something obvious? >