I'm attempting to parse some log files using the RegexExtractAll function in
the piggybank. Everything was going along swimmingly until I tried to
include an expression which contains a semi-colon.

Here's the short, reproducible version of what I'm trying to do...

Given an input file:

/test1.txt (in the hdfs)
1;a
2;b
3;c
4;d
5;e


And the following Pig script:

REGISTER /tmp/piggybank.jar ;
DEFINE REGEXEXTRACTALL
org.apache.pig.piggybank.evaluation.string.RegexExtractAll();
lines = LOAD '/test1.txt' AS (line:chararray);
delimited = FOREACH lines GENERATE FLATTEN (
        REGEXEXTRACTALL(line, '^(\\d+);(\\w+)$')
) AS (
        digit:int,
        word:chararray
);
DUMP delimited;


I receive the following error:

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing.
Lexical error at line 5, column 40.  Encountered: <EOF> after :
"\'^(\\\\d+);"


If I change the source file to use commas (or pipes, or dashes, etc...) and
change the regex accordingly, it works as expected. It looks to me like Pig
is not parsing the regex string correctly, and is assuming that the
semi-colon (even though it's part of a quoted string) is an EOL character.
I've tried escaping the semi-colon, putting another at the end of the
REGEXEXTRACTALL line, etc... nothing seems to prevent Pig from dying.

Can anyone tell me if I'm missing something obvious?

Reply via email to