Theo, this is awesome.  We at Twitter are hoping to contribute to and extend
the great work you've done.

Kevin

On Wed, Jan 13, 2010 at 10:01 PM, Theo Hultberg <[email protected]> wrote:

> Please do!
>
> T#
>
> On Thu, Jan 14, 2010 at 12:02 AM, Alan Gates <[email protected]> wrote:
> > Theo,
> >
> > This looks really interesting.  Can I put a link to it on our page for
> tools
> > use with Pig, http://wiki.apache.org/pig/PigTools ?
> >
> > Alan.
> >
> > On Jan 13, 2010, at 10:38 AM, Theo Hultberg wrote:
> >
> >> Hi,
> >>
> >> I've written a Ruby DSL for writing Pig scripts, which I hope might
> >> interest some of you. It makes it possible to do a lot of things you
> >> can't do in Pig Latin, like loops, reuse code through functions, and
> >> introspection on relation schemas. Basically you write some Ruby code
> >> that looks a lot like Pig Latin, and you get the equivalent Pig Latin
> >> as output. Loops are unrolled, functions are inlined, and so on.
> >>
> >> There's a lot of documentation and examples on GitHub:
> >> http://github.com/iconara/piglet, and here are a few examples too:
> >>
> >> If you run this Ruby code through Piglet
> >>
> >>  a = load 'input', :schema => [:x, :y]
> >>  b = a.group :x
> >>  store b, 'output'
> >>
> >> you will get the following Pig Latin code:
> >>
> >>  relation_2 = LOAD 'input' AS (x, y);
> >>  relation_1 = GROUP relation_2 BY x;
> >>  STORE relation_1 INTO 'output';
> >>
> >> More or less the same, don't you think? (Piglet can't determine the
> >> names of the variables, unfortunately, thus the relation names are not
> >> fantastic, I might get that working in a future version).
> >>
> >> I wrote Piglet when some Pig scripts I was working on started to get
> >> very repetitive. I had a relation with a few fields that were keys and
> >> a few that were numbers and I wanted to get the sums for each value of
> >> each of the key fields. This meant having to repeat the same GROUP and
> >> FOREACH operations once for each key, even though the only thing that
> >> changed was the name of the field that I grouped by. Having to repeat
> >> the same code again and again for every key was frustrating, and I
> >> dreamed up a way of doing the same thing in Ruby. With Piglet I can
> >> now do something like this:
> >>
> >>  input = load('input', :schema => %w(country browser site
> >> pages_visited visit_duration))
> >>
> >>  %w(country browser site).each do |dimension|
> >>   grouped = input.group(dimension).foreach do |r|
> >>     [
> >>       r[0],
> >>       r[1].pages_visited.avg,
> >>       r[1].visit_duration.sum
> >>     ]
> >>   end
> >>
> >>   store(grouped, "output-#{dimension}")
> >>  end
> >>
> >> which will be translated to this Pig Latin code:
> >>
> >>  relation_1 = LOAD 'input' AS (country, browser, site, visit_duration);
> >>  relation_3 = GROUP relation_1 BY country;
> >>  relation_2 = FOREACH relation_3 GENERATE $0, AVG($1.pages_visited),
> >> SUM($1.visit_duration);
> >>  STORE relation_2 INTO 'output-country';
> >>  relation_5 = GROUP relation_1 BY browser;
> >>  relation_4 = FOREACH relation_5 GENERATE $0, AVG($1.pages_visited),
> >> SUM($1.visit_duration);
> >>  STORE relation_4 INTO 'output-browser';
> >>  relation_7 = GROUP relation_1 BY site;
> >>  relation_6 = FOREACH relation_7 GENERATE $0, AVG($1.pages_visited),
> >> SUM($1.visit_duration);
> >>  STORE relation_6 INTO 'output-site';
> >>
> >> where you can see how the loop has been unrolled and all the
> >> operations repeated for each key.
> >>
> >> I hope that Piglet will help some of you write DRYer code. It doesn't
> >> solve all problems, and there are things which are not supported at
> >> all yet, but with your help I think it can be a very good companion to
> >> Pig.
> >>
> >> If you want to know more read the documentation on GitHub:
> >> http://github.com/iconara/piglet, or send me a mail either through
> >> GitHub, to my e-mail ([email protected]), or via Twitter (@iconara).
> >>
> >> yours,
> >> Theo
> >
> >
>

Reply via email to