pig-user  

Re: some few examples

Earl Cahill
Thu, 25 Sep 2008 11:01:43 -0700

For now, for the examples I will put up an examples page on just a site of 
mine.  So far as the code, I could maybe put up a google code pig examples 
project or something?

I am mostly interested in parsing apache logs, and I understand there are 
likely other pig uses, but here is some code I have either written or would 
like to write

CommonLogParser - parses the standard apache access_log
CombinedLogParser - parses a log based on the combined LogFormat
DayExtractor - given the standard apache time format  (%t), extracts the day 
(MM-dd-yyyy)
HostExtactor - given a url, extract the host
IsSearchBotHit - given a user agent, determine if the hit came from a search bot
IsPageView - given a userAgent and a uri, determine if the hit is a page view 
(ie, an html hit, rather than a js, image or whatever hit)
MyLength - return the length of the given field
SearchEngineExtractor - given the userAgent, when appropriate, return a name 
for the search engine like "Google", "Google Uzbekistan" or "Godado"
SearchTermsExtractor - given the userAgent, when appropriate, return the search 
terms

most of the classes are rather short, but I think folks would rather not have 
to rewrite them.  Except for the SearchTermsExtractor, I am either done or 
pretty close on all of these.  A couple of them may require some maintainence 
like the search engine classes.

With the classes, I think I could do pretty well everything on my list.

I would like the classes to be production worthy and would be happy to 
contribute them.  Thoughts?

Thanks,
Earl



----- Original Message ----
From: Alan Gates <[EMAIL PROTECTED]>
To: pig-user@incubator.apache.org; Earl Cahill <[EMAIL PROTECTED]>
Sent: Thursday, September 25, 2008 9:29:14 AM
Subject: Re: some few examples

I don't think we have anything like this yet, but I think having a 
PigUserExamples page, with links to pages of specific examples, like 
yours, would be great.  The PigUserExamples page could be linked off the 
main page.

As far as where to put your code, if it's something that could actually 
be used for pig scripts, it can go in contrib under piggybank (the user 
contributed UDFs).  If it's really for tutorial purposes and not 
production worthy I'm not sure.  We could add a tutorial section to 
contrib.  I think the existing tutorial is a unit aimed at helping 
people get started, so we don't want to add to it.

Alan.

Earl Cahill wrote:
> howdy,
>
> Just starting to dive into pig, and have had a hard time finding examples.  I 
> would like to put up some examples (on the wiki?) of what I hope to be simple 
> scripts that could help find the following on a per host / per day basis
>
> hits
> hits per canonized userAgent
> average microseconds to serve per uri
> hits per canonized search engine
> hits per canonized search engine terms
> bytes
> hits per referer
> hits per canonized referer host
> etc
>
> Has such a library already been started?
>
> Some of the scripts will have to rely on some java helper code, which I would 
> be happy to contribute, but where can I put it?  Perhaps in tutorial.jar?  
> helper.jar?  Anyone have thoughts about such things on the wiki?
>
> Thanks,
> Earl
>
>
>
>      
>