Thanks for the plug, Ricardo!
A word of warning regarding that blog post -- it's written to explain
things, not to show how one would run them in production. So it's a
bit verbose and does silly things like calling out to awk. Don't take
it as a style guide :-).
Someone recently commented that it's way too long for the job it does,
so I shrunk it -- here's the equivalent, but more terse version:
register /home/dvryaboy/src/pig/trunk/piggybank.jar;
DEFINE LogLoader storage.apachelog.CombinedLogLoader();
DEFINE iplookup `ipwrapper.sh $GEO` ship ('ipwrapper.sh')
cache('/home/dvryaboy/tmp/$GEO$GEO');
logs = LOAD '$LOGS' USING LogLoader as
(remoteAddr, remoteLogname, user, time, method,
uri, proto, status, bytes, referer, userAgent);
logs = FILTER logs BY bytes != '-' AND uri matches '/apache.*'
AND (NOT filtering.IsBotUA(userAgent));
with_country = STREAM notbots THROUGH `ipwrapper.sh $GEO`
AS (country_code, country, state, city, ip, time, uri, bytes, userAgent);
geo_uri_group_counts =
ORDER (FOREACH (GROUP with_country BY country_code)
GENERATE group, COUNT($1) AS cnt, SUM($1.bytes) AS
total_bytes) BY cnt DESC;
STORE geo_uri_group_counts INTO 'by_country.tsv';
by_state_cnt =
ORDER(FOREACH (GROUP (FILTER with_country BY country_code == 'US') BY state)
GENERATE group, COUNT(us_only.state) AS cnt,
SUM(us_only.bytes) AS total_bytes)
BY cnt;
STORE by_state_cnt into 'by_state.tsv';
-- and more of the same for project_count
-Dmitriy
On Mon, Jan 11, 2010 at 11:35 AM, Ricardo Varela <[email protected]> wrote:
> hey Chris,
>
> If you find it hard to define UDFs, maybe you can start by using
> scripts written in PHP if you feel more comfortable with it, or even
> shell scripts. You can do that with the Pig streaming interface
> (http://wiki.apache.org/pig/PigStreamingFunctionalSpec) It won't have
> as much perf as the proper UDFs, but it is useful for trying (I often
> prototype with STREAM first and then create some UDFs if needed)
>
> I found the examples in the doc and in the following article from
> Dmitriy Ryaboy very useful to start with:
>
> http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/
>
> Good luck in your tests and hope you like Pig and Hadoop!
>
> Saludos!
>
> ---
> ricardo
>
> On Mon, Jan 11, 2010 at 6:29 PM, Chris Hartjes <[email protected]>
> wrote:
>> My apologies if this is the wrong mailing list to ask this question. I've
>> started playing around with Pig and Hadoop, with the intention of using it
>> to do some analysis of a collection of MySQL slow query log files. I am not
>> a Java programmer (been using PHP for a very long time, dabbled in other
>> languages as required) so I am slightly intimidated by the documentation in
>> Pig for writing your own UDF's.
>>
>> If anyone has done anything like this, I would appreciate some tips and some
>> pointers on how to approach it. Sure, I could hunker down and learn to use
>> some CLI tools for analyzing the slow query log, but then I couldn't use Pig
>> and Hadoop. ;)
>>
>> --
>> Chris Hartjes
>>
>
>
>
> --
> Ricardo Varela - http://phobeo.com - http://twitter.com/phobeo
> "Though this be madness, yet there's method in 't"
>