Please do!

T#

On Thu, Jan 14, 2010 at 12:02 AM, Alan Gates <[email protected]> wrote:
> Theo,
>
> This looks really interesting.  Can I put a link to it on our page for tools
> use with Pig, http://wiki.apache.org/pig/PigTools ?
>
> Alan.
>
> On Jan 13, 2010, at 10:38 AM, Theo Hultberg wrote:
>
>> Hi,
>>
>> I've written a Ruby DSL for writing Pig scripts, which I hope might
>> interest some of you. It makes it possible to do a lot of things you
>> can't do in Pig Latin, like loops, reuse code through functions, and
>> introspection on relation schemas. Basically you write some Ruby code
>> that looks a lot like Pig Latin, and you get the equivalent Pig Latin
>> as output. Loops are unrolled, functions are inlined, and so on.
>>
>> There's a lot of documentation and examples on GitHub:
>> http://github.com/iconara/piglet, and here are a few examples too:
>>
>> If you run this Ruby code through Piglet
>>
>>  a = load 'input', :schema => [:x, :y]
>>  b = a.group :x
>>  store b, 'output'
>>
>> you will get the following Pig Latin code:
>>
>>  relation_2 = LOAD 'input' AS (x, y);
>>  relation_1 = GROUP relation_2 BY x;
>>  STORE relation_1 INTO 'output';
>>
>> More or less the same, don't you think? (Piglet can't determine the
>> names of the variables, unfortunately, thus the relation names are not
>> fantastic, I might get that working in a future version).
>>
>> I wrote Piglet when some Pig scripts I was working on started to get
>> very repetitive. I had a relation with a few fields that were keys and
>> a few that were numbers and I wanted to get the sums for each value of
>> each of the key fields. This meant having to repeat the same GROUP and
>> FOREACH operations once for each key, even though the only thing that
>> changed was the name of the field that I grouped by. Having to repeat
>> the same code again and again for every key was frustrating, and I
>> dreamed up a way of doing the same thing in Ruby. With Piglet I can
>> now do something like this:
>>
>>  input = load('input', :schema => %w(country browser site
>> pages_visited visit_duration))
>>
>>  %w(country browser site).each do |dimension|
>>   grouped = input.group(dimension).foreach do |r|
>>     [
>>       r[0],
>>       r[1].pages_visited.avg,
>>       r[1].visit_duration.sum
>>     ]
>>   end
>>
>>   store(grouped, "output-#{dimension}")
>>  end
>>
>> which will be translated to this Pig Latin code:
>>
>>  relation_1 = LOAD 'input' AS (country, browser, site, visit_duration);
>>  relation_3 = GROUP relation_1 BY country;
>>  relation_2 = FOREACH relation_3 GENERATE $0, AVG($1.pages_visited),
>> SUM($1.visit_duration);
>>  STORE relation_2 INTO 'output-country';
>>  relation_5 = GROUP relation_1 BY browser;
>>  relation_4 = FOREACH relation_5 GENERATE $0, AVG($1.pages_visited),
>> SUM($1.visit_duration);
>>  STORE relation_4 INTO 'output-browser';
>>  relation_7 = GROUP relation_1 BY site;
>>  relation_6 = FOREACH relation_7 GENERATE $0, AVG($1.pages_visited),
>> SUM($1.visit_duration);
>>  STORE relation_6 INTO 'output-site';
>>
>> where you can see how the loop has been unrolled and all the
>> operations repeated for each key.
>>
>> I hope that Piglet will help some of you write DRYer code. It doesn't
>> solve all problems, and there are things which are not supported at
>> all yet, but with your help I think it can be a very good companion to
>> Pig.
>>
>> If you want to know more read the documentation on GitHub:
>> http://github.com/iconara/piglet, or send me a mail either through
>> GitHub, to my e-mail ([email protected]), or via Twitter (@iconara).
>>
>> yours,
>> Theo
>
>

Reply via email to