pig-user  

Re: some few examples

Alan Gates
Mon, 29 Sep 2008 09:49:11 -0700

Earl,

I'm sure others will find these useful, as parsing apache logs is a common task. Except for MyLength, they seem specific to log parsing. So you might want to put them in an ApacheLogParser package (or some similar suitable name). Any loader functions should go in org.apache.pig.piggybank.storage.ApacheLogParser, and the evaluation functions under org.apache.pig.piggybank.evaluation.util.ApacheLogParser.

Alan.

Earl Cahill wrote:
For now, for the examples I will put up an examples page on just a site of 
mine.  So far as the code, I could maybe put up a google code pig examples 
project or something?

I am mostly interested in parsing apache logs, and I understand there are 
likely other pig uses, but here is some code I have either written or would 
like to write

CommonLogParser - parses the standard apache access_log
CombinedLogParser - parses a log based on the combined LogFormat
DayExtractor - given the standard apache time format  (%t), extracts the day 
(MM-dd-yyyy)
HostExtactor - given a url, extract the host
IsSearchBotHit - given a user agent, determine if the hit came from a search bot
IsPageView - given a userAgent and a uri, determine if the hit is a page view 
(ie, an html hit, rather than a js, image or whatever hit)
MyLength - return the length of the given field
SearchEngineExtractor - given the userAgent, when appropriate, return a name for the search engine like 
"Google", "Google Uzbekistan" or "Godado"
SearchTermsExtractor - given the userAgent, when appropriate, return the search 
terms

most of the classes are rather short, but I think folks would rather not have 
to rewrite them.  Except for the SearchTermsExtractor, I am either done or 
pretty close on all of these.  A couple of them may require some maintainence 
like the search engine classes.

With the classes, I think I could do pretty well everything on my list.

I would like the classes to be production worthy and would be happy to 
contribute them.  Thoughts?

Thanks,
Earl



----- Original Message ----
From: Alan Gates <[EMAIL PROTECTED]>
To: pig-user@incubator.apache.org; Earl Cahill <[EMAIL PROTECTED]>
Sent: Thursday, September 25, 2008 9:29:14 AM
Subject: Re: some few examples

I don't think we have anything like this yet, but I think having a PigUserExamples page, with links to pages of specific examples, like yours, would be great. The PigUserExamples page could be linked off the main page.

As far as where to put your code, if it's something that could actually be used for pig scripts, it can go in contrib under piggybank (the user contributed UDFs). If it's really for tutorial purposes and not production worthy I'm not sure. We could add a tutorial section to contrib. I think the existing tutorial is a unit aimed at helping people get started, so we don't want to add to it.

Alan.

Earl Cahill wrote:
howdy,

Just starting to dive into pig, and have had a hard time finding examples.  I 
would like to put up some examples (on the wiki?) of what I hope to be simple 
scripts that could help find the following on a per host / per day basis

hits
hits per canonized userAgent
average microseconds to serve per uri
hits per canonized search engine
hits per canonized search engine terms
bytes
hits per referer
hits per canonized referer host
etc

Has such a library already been started?

Some of the scripts will have to rely on some java helper code, which I would 
be happy to contribute, but where can I put it?  Perhaps in tutorial.jar?  
helper.jar?  Anyone have thoughts about such things on the wiki?

Thanks,
Earl