pig-user  

Re: arbitrary LOADing and parsing

Earl Cahill
Thu, 25 Sep 2008 17:31:21 -0700

Well, figured out a couple things, but the big thing was that Datum isn't in 
the pig.jar that comes with the tutorial.  Once I got that up and built, then I 
just had to

ArrayList<Datum> list = new ArrayList<Datum>();
list.add(new DataAtom(remoteAddr));
...
return new Tuple(list);

Wow, knowing that would have made my night last night much shorter.  Now, I am 
rather nicely parsing a generic access_log :)

Thanks,
Earl



----- Original Message ----
From: Earl Cahill <[EMAIL PROTECTED]>
To: pig-user@incubator.apache.org; Alan Gates <[EMAIL PROTECTED]>
Sent: Thursday, September 25, 2008 5:57:05 PM
Subject: Re: arbitrary LOADing and parsing

So I am in getNext, I have the lines parsed and my variables all set.  If I 
have variables called, say

remoteAddr, remoteLogname, user, time, method, uri, proto, bytes

what do I do next?  What do I return?  'fraid the Tuples and the like are still 
rather new to me.

Thanks,
Earl


----- Original Message ----
From: Alan Gates <[EMAIL PROTECTED]>
To: pig-user@incubator.apache.org; Earl Cahill <[EMAIL PROTECTED]>
Sent: Thursday, September 25, 2008 2:53:28 PM
Subject: Re: arbitrary LOADing and parsing

http://incubator.apache.org/pig/version_control.html

Alan.

Earl Cahill wrote:
> Alan,
>
>
> Thanks, will take a look tonight.  I guess I can just check out the source?  
> Right now I am doing everything with the jars from the posted 
> (http://wiki.apache.org/pig-data/attachments/PigTutorial/attachments/pigtutorial.tar.gz),
>  is there a better place to get the code?
>
> Thanks,
> Earl
>
>
> ----- Original Message ----
> From: Alan Gates <[EMAIL PROTECTED]>
> To: pig-user@incubator.apache.org; Earl Cahill <[EMAIL PROTECTED]>
> Sent: Thursday, September 25, 2008 9:50:47 AM
> Subject: Re: arbitrary LOADing and parsing
>
> I think what you want here is a load function rather than an eval 
> function.  Check out org.apache.pig.LoadFunc.  Then your pig latin would 
> look like:
>
> raw = LOAD 'access_log' USING com.loghelper.CommongLogLoader AS 
> (remoteAddr, remoteLogname, user, time, method, uri, proto, bytes);
>
> Take a look at org.apache.pig.builtin.PigStorage.  You should be able to 
> reuse all of this except for Tuple getNext().  For that function once 
> you get the line from in.readLine you'll need to do the parsing yourself.
>
> Alan.
>
> Earl Cahill wrote:
>  
>> I would like to parse a standard access log and get named variables back.  
>> Thinking I need to read in all the lines, then send them through my parsing 
>> function.  Perhaps the two steps can be combined, but something like
>>
>> preraw = LOAD 'access_log' USING PigStorage() AS (line);
>> raw = FOREACH preraw GENERATE com.loghelper.CommonLogParser(line);
>>
>> So I have CommonLogParser parsing the line well, but I don't know what to 
>> put into output so that I can do this
>>
>> raw = FOREACH preraw GENERATE com.loghelper.CommonLogParser(line) AS 
>> remoteAddr, remoteLogname, user, time, method, uri, proto, bytes;
>>
>> Extending EvalFunc<DataBag> I tried doing this
>>
>> Tuple tuple = new Tuple();
>>
>> String remoteAddr = commonLogMatcher.group(1);
>> output.add(new Tuple(remoteAddr));
>> ...
>> output.add(tuple);
>>
>> to no avail and several other such failing schemes (including extending 
>> EvalFunc<DataMap>).
>>
>> Or perhaps there are already parsers that will parse a standard access log?
>>
>> Ideas?
>>
>> Thanks,
>> Earl
>>
>>
>>      
>>  
>>    
>
>
>      
>