pig-user  

Re: Escape characters in Pig Queries

Mridul Muralidharan
Thu, 10 Apr 2008 02:39:11 -0700

Mridul Muralidharan wrote:

Hi Michael,

Not sure about the character escaping, but I do have my UDF's in jars independent of pig jars - and that works fine for me. You might want to check for path issues ?

And if there is an empty constructor (or no constructor) for the udf.
iirc pig uses the null constructor to create the udf.

Mridul


Regards,
Mridul

Michael Harris wrote:
I guess my last message was obvious/stupid since I am not getting any
responses, but hopefully I won't be 0/2.

I love using Pig and I think it's a fantastic tool for creating complex,
map-reduce programs quickly, but that said I am having 2 problems in
addition to the one below. Hopefully I am just missing something easy
and someone can shoot me a quick response.

I have written my own eval func that extracts events from our event log.
It then splits the event by some arbitrary regex and then finds the last
match from that event that does not match another regex. The queries are
as follows.

eventlog = LOAD
'/user/hadoop/index8mbGZnotes/{1205478000254_1205857683529.gz,1205857686
408_1206295646386.gz,1206295646442_1206757710701.gz,1206757712403_120711
3039900.gz,1207113039930_1207205997234.gz}' USING PigStorage('    ');
filterDate = FILTER eventlog BY $1 >= '1204358400000' AND $1 <=
'1209625200000';
filterCh = FILTER filterDate BY $15 eq 'Sony'  OR $15 eq 'Dell'  OR $15
eq 'HP' ;
filter1 = FILTER filterCh BY  ($5 == 11 AND $6 == 15 AND $7 == 406 ) ;
filtered = FOREACH filter1 GENERATE
LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/error.*)','[0-9]{2}:[0-9]{2}
:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;
grouped = GROUP filtered BY ($0, $1);
resultUnordered = FOREACH grouped GENERATE FLATTEN(group),
FLATTEN(COUNT(filtered)) PARALLEL 14;

The func is LastPageExtractor(inputValue, excludeRegex, splitRegex)

This all works fine, but I would like to change my split regex to
\\|+[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4} , however when
I do that I get this :

Exception in thread "Thread-6"
org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at
line 1, column 93.  Encountered: "|" (124), after : "\'\\"

Is there some special escape sequence I should know about? I searched
escape in PigLatin Wiki and found nothing.

The second problem I have is I am not able to register jars/funcs
without packaging them into the pig.jar in the
org.apache.pig.impl.builtin package. I have tried everything I can think
of and everything in the documentation. I register the jar with
PigServer.registerJar and try to use the fully qualified function name
all the task trackers fail with:

java.lang.RuntimeException: could not instantiate
'telespree.analytics.pig.LastPageExtractor' with arguments '[]'

I do:

server.registerJar("c:\\telespree.jar");

and

filtered = FOREACH filter1 GENERATE
telespree.analytics.pig.LastPageExtractor($8,'.*(ui/cancel.*)|(.*ui/erro
r.*)','[0-9]{2}:[0-9]{2}:[0-9]{2} [0-9]{2}/[0-9]{2}/[0-9]{4}'), $15;");

I even tried to put these functions in the default package in pig.jar
since I saw in the code you do lookups with packageImportList.add("");
        packageImportList.add("org.apache.pig.builtin.");
        packageImportList.add("com.yahoo.pig.yst.sds.ULT.");
packageImportList.add("org.apache.pig.impl.builtin."); So I figured using the "" import would find my function, however alas I get the same error : java.lang.RuntimeException: could not instantiate 'LastPageExtractor'
with arguments '[]'

However if I package them in org.apache.pig.impl.builtin it all works
fine.

Any help on these 3 areas would be much appreciated!

-Michael




-----Original Message-----
From: Michael Harris [EMAIL PROTECTED] Sent: Wednesday, April 02, 2008 10:47 AM
To: pig-user@incubator.apache.org
Subject: MapReduceLauncher static fields

Hello,

I have written a pig application that does a fixed set of queries
on-demand through a web interface. I am trying to get the progress of
the queries from the PigServer, but I have noticed that the source of
the progress data is all static fields in the MapReduceLauncher. Clearly
my webapp must be able to handle multiple concurrent pig queries (and be
thread-safe) and I would like to report the progress of each individual
query (job set) to the end user.  Do these static fields indicate that I
would get the progress of multiple concurrent queries initiated by
different PigServer instances? or would I get the overall progress of
the MapReduceLauncher for all queries currently being executed?

Thanks,
Michael