Thanks for the plug, Ricardo!
A word of warning regarding that blog post -- it's written to explain
things, not to show how one would run them in production. So it's a
bit verbose and does silly things like calling out to awk. Don't take
it as a style guide :-).

Someone recently commented that it's way too long for the job it does,
so I shrunk it -- here's the equivalent, but more terse version:

register /home/dvryaboy/src/pig/trunk/piggybank.jar;
DEFINE LogLoader storage.apachelog.CombinedLogLoader();
DEFINE iplookup `ipwrapper.sh $GEO` ship ('ipwrapper.sh')
cache('/home/dvryaboy/tmp/$GEO$GEO');

logs = LOAD '$LOGS' USING LogLoader as
   (remoteAddr, remoteLogname, user, time, method,
    uri, proto, status, bytes, referer, userAgent);

logs = FILTER logs BY bytes != '-' AND  uri matches '/apache.*'
    AND (NOT filtering.IsBotUA(userAgent));

with_country = STREAM notbots THROUGH `ipwrapper.sh $GEO`
       AS (country_code, country, state, city, ip, time, uri, bytes, userAgent);

geo_uri_group_counts =
   ORDER (FOREACH (GROUP with_country BY country_code)
               GENERATE group, COUNT($1) AS cnt, SUM($1.bytes) AS
total_bytes) BY cnt DESC;

STORE geo_uri_group_counts INTO 'by_country.tsv';

by_state_cnt =
   ORDER(FOREACH (GROUP (FILTER with_country BY country_code == 'US') BY state)
           GENERATE group, COUNT(us_only.state) AS cnt,
SUM(us_only.bytes) AS total_bytes)
   BY cnt;

STORE by_state_cnt into 'by_state.tsv';

-- and more of the same for project_count

-Dmitriy

On Mon, Jan 11, 2010 at 11:35 AM, Ricardo Varela <[email protected]> wrote:
> hey Chris,
>
> If you find it hard to define UDFs, maybe you can start by using
> scripts written in PHP if you feel more comfortable with it, or even
> shell scripts. You can do that with the Pig streaming interface
> (http://wiki.apache.org/pig/PigStreamingFunctionalSpec) It won't have
> as much perf as the proper UDFs, but it is useful for trying (I often
> prototype with STREAM first and then create some UDFs if needed)
>
> I found the examples in the doc and in the following article from
> Dmitriy Ryaboy very useful to start with:
>
> http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/
>
> Good luck in your tests and hope you like Pig and Hadoop!
>
> Saludos!
>
> ---
> ricardo
>
> On Mon, Jan 11, 2010 at 6:29 PM, Chris Hartjes <[email protected]> 
> wrote:
>> My apologies if this is the wrong mailing list to ask this question.  I've
>> started playing around with Pig and Hadoop, with the intention of using it
>> to do some analysis of a collection of MySQL slow query log files.  I am not
>> a Java programmer (been using PHP for a very long time, dabbled in other
>> languages as required) so I am slightly intimidated by the documentation in
>> Pig for writing your own UDF's.
>>
>> If anyone has done anything like this, I would appreciate some tips and some
>> pointers on how to approach it.  Sure, I could hunker down and learn to use
>> some CLI tools for analyzing the slow query log, but then I couldn't use Pig
>> and Hadoop. ;)
>>
>> --
>> Chris Hartjes
>>
>
>
>
> --
> Ricardo Varela  -  http://phobeo.com  -  http://twitter.com/phobeo
> "Though this be madness, yet there's method in 't"
>

Reply via email to