Alan Gates
Mon, 29 Sep 2008 09:49:11 -0700
Earl,I'm sure others will find these useful, as parsing apache logs is a common task. Except for MyLength, they seem specific to log parsing. So you might want to put them in an ApacheLogParser package (or some similar suitable name). Any loader functions should go in org.apache.pig.piggybank.storage.ApacheLogParser, and the evaluation functions under org.apache.pig.piggybank.evaluation.util.ApacheLogParser.
Alan. Earl Cahill wrote:
For now, for the examples I will put up an examples page on just a site of mine. So far as the code, I could maybe put up a google code pig examples project or something? I am mostly interested in parsing apache logs, and I understand there are likely other pig uses, but here is some code I have either written or would like to write CommonLogParser - parses the standard apache access_log CombinedLogParser - parses a log based on the combined LogFormat DayExtractor - given the standard apache time format (%t), extracts the day (MM-dd-yyyy) HostExtactor - given a url, extract the host IsSearchBotHit - given a user agent, determine if the hit came from a search bot IsPageView - given a userAgent and a uri, determine if the hit is a page view (ie, an html hit, rather than a js, image or whatever hit) MyLength - return the length of the given field SearchEngineExtractor - given the userAgent, when appropriate, return a name for the search engine like "Google", "Google Uzbekistan" or "Godado" SearchTermsExtractor - given the userAgent, when appropriate, return the search terms most of the classes are rather short, but I think folks would rather not have to rewrite them. Except for the SearchTermsExtractor, I am either done or pretty close on all of these. A couple of them may require some maintainence like the search engine classes. With the classes, I think I could do pretty well everything on my list. I would like the classes to be production worthy and would be happy to contribute them. Thoughts? Thanks, Earl ----- Original Message ---- From: Alan Gates <[EMAIL PROTECTED]> To: pig-user@incubator.apache.org; Earl Cahill <[EMAIL PROTECTED]> Sent: Thursday, September 25, 2008 9:29:14 AM Subject: Re: some few examplesI don't think we have anything like this yet, but I think having a PigUserExamples page, with links to pages of specific examples, like yours, would be great. The PigUserExamples page could be linked off the main page.As far as where to put your code, if it's something that could actually be used for pig scripts, it can go in contrib under piggybank (the user contributed UDFs). If it's really for tutorial purposes and not production worthy I'm not sure. We could add a tutorial section to contrib. I think the existing tutorial is a unit aimed at helping people get started, so we don't want to add to it.Alan. Earl Cahill wrote:howdy, Just starting to dive into pig, and have had a hard time finding examples. I would like to put up some examples (on the wiki?) of what I hope to be simple scripts that could help find the following on a per host / per day basis hits hits per canonized userAgent average microseconds to serve per uri hits per canonized search engine hits per canonized search engine terms bytes hits per referer hits per canonized referer host etc Has such a library already been started? Some of the scripts will have to rely on some java helper code, which I would be happy to contribute, but where can I put it? Perhaps in tutorial.jar? helper.jar? Anyone have thoughts about such things on the wiki? Thanks, Earl