That's a parser bug, it's treating all semicolons as semicolons, without
paying attention to quoting.

Please open a ticket on the Pig Jira.

In the meantime, there's a workaround:

delimited = FOREACH lines GENERATE FLATTEN (REGEXEXTRACTALL(line,
'^(\\d+)\\u003B(\\w+)$')) AS (digit:int,word:chararray)

On Mon, Aug 30, 2010 at 11:38 AM, Christopher Hackman <
christopher.hack...@gmail.com> wrote:

> I'm attempting to parse some log files using the RegexExtractAll function
> in
> the piggybank. Everything was going along swimmingly until I tried to
> include an expression which contains a semi-colon.
>
> Here's the short, reproducible version of what I'm trying to do...
>
> Given an input file:
>
> /test1.txt (in the hdfs)
> 1;a
> 2;b
> 3;c
> 4;d
> 5;e
>
>
> And the following Pig script:
>
> REGISTER /tmp/piggybank.jar ;
> DEFINE REGEXEXTRACTALL
> org.apache.pig.piggybank.evaluation.string.RegexExtractAll();
> lines = LOAD '/test1.txt' AS (line:chararray);
> delimited = FOREACH lines GENERATE FLATTEN (
>        REGEXEXTRACTALL(line, '^(\\d+);(\\w+)$')
> ) AS (
>        digit:int,
>        word:chararray
> );
> DUMP delimited;
>
>
> I receive the following error:
>
> ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing.
> Lexical error at line 5, column 40.  Encountered: <EOF> after :
> "\'^(\\\\d+);"
>
>
> If I change the source file to use commas (or pipes, or dashes, etc...) and
> change the regex accordingly, it works as expected. It looks to me like Pig
> is not parsing the regex string correctly, and is assuming that the
> semi-colon (even though it's part of a quoted string) is an EOL character.
> I've tried escaping the semi-colon, putting another at the end of the
> REGEXEXTRACTALL line, etc... nothing seems to prevent Pig from dying.
>
> Can anyone tell me if I'm missing something obvious?
>

Reply via email to