pig-user  

Re: arbitrary LOADing and parsing

Alan Gates
Thu, 25 Sep 2008 13:56:29 -0700

http://incubator.apache.org/pig/version_control.html

Alan.

Earl Cahill wrote:
Alan,


Thanks, will take a look tonight.  I guess I can just check out the source?  
Right now I am doing everything with the jars from the posted 
(http://wiki.apache.org/pig-data/attachments/PigTutorial/attachments/pigtutorial.tar.gz),
 is there a better place to get the code?

Thanks,
Earl


----- Original Message ----
From: Alan Gates <[EMAIL PROTECTED]>
To: pig-user@incubator.apache.org; Earl Cahill <[EMAIL PROTECTED]>
Sent: Thursday, September 25, 2008 9:50:47 AM
Subject: Re: arbitrary LOADing and parsing

I think what you want here is a load function rather than an eval function. Check out org.apache.pig.LoadFunc. Then your pig latin would look like:

raw = LOAD 'access_log' USING com.loghelper.CommongLogLoader AS (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes);

Take a look at org.apache.pig.builtin.PigStorage. You should be able to reuse all of this except for Tuple getNext(). For that function once you get the line from in.readLine you'll need to do the parsing yourself.

Alan.

Earl Cahill wrote:
I would like to parse a standard access log and get named variables back.  
Thinking I need to read in all the lines, then send them through my parsing 
function.  Perhaps the two steps can be combined, but something like

preraw = LOAD 'access_log' USING PigStorage() AS (line);
raw = FOREACH preraw GENERATE com.loghelper.CommonLogParser(line);

So I have CommonLogParser parsing the line well, but I don't know what to put 
into output so that I can do this

raw = FOREACH preraw GENERATE com.loghelper.CommonLogParser(line) AS 
remoteAddr, remoteLogname, user, time, method, uri, proto, bytes;

Extending EvalFunc<DataBag> I tried doing this

Tuple tuple = new Tuple();

String remoteAddr = commonLogMatcher.group(1);
output.add(new Tuple(remoteAddr));
...
output.add(tuple);

to no avail and several other such failing schemes (including extending 
EvalFunc<DataMap>).

Or perhaps there are already parsers that will parse a standard access log?

Ideas?

Thanks,
Earl