Hi,
I've written a Ruby DSL for writing Pig scripts, which I hope might
interest some of you. It makes it possible to do a lot of things you
can't do in Pig Latin, like loops, reuse code through functions, and
introspection on relation schemas. Basically you write some Ruby
code
that looks a lot like Pig Latin, and you get the equivalent Pig
Latin
as output. Loops are unrolled, functions are inlined, and so on.
There's a lot of documentation and examples on GitHub:
http://github.com/iconara/piglet, and here are a few examples too:
If you run this Ruby code through Piglet
a = load 'input', :schema => [:x, :y]
b = a.group :x
store b, 'output'
you will get the following Pig Latin code:
relation_2 = LOAD 'input' AS (x, y);
relation_1 = GROUP relation_2 BY x;
STORE relation_1 INTO 'output';
More or less the same, don't you think? (Piglet can't determine the
names of the variables, unfortunately, thus the relation names are
not
fantastic, I might get that working in a future version).
I wrote Piglet when some Pig scripts I was working on started to get
very repetitive. I had a relation with a few fields that were keys
and
a few that were numbers and I wanted to get the sums for each
value of
each of the key fields. This meant having to repeat the same GROUP
and
FOREACH operations once for each key, even though the only thing
that
changed was the name of the field that I grouped by. Having to
repeat
the same code again and again for every key was frustrating, and I
dreamed up a way of doing the same thing in Ruby. With Piglet I can
now do something like this:
input = load('input', :schema => %w(country browser site
pages_visited visit_duration))
%w(country browser site).each do |dimension|
grouped = input.group(dimension).foreach do |r|
[
r[0],
r[1].pages_visited.avg,
r[1].visit_duration.sum
]
end
store(grouped, "output-#{dimension}")
end
which will be translated to this Pig Latin code:
relation_1 = LOAD 'input' AS (country, browser, site,
visit_duration);
relation_3 = GROUP relation_1 BY country;
relation_2 = FOREACH relation_3 GENERATE $0, AVG($1.pages_visited),
SUM($1.visit_duration);
STORE relation_2 INTO 'output-country';
relation_5 = GROUP relation_1 BY browser;
relation_4 = FOREACH relation_5 GENERATE $0, AVG($1.pages_visited),
SUM($1.visit_duration);
STORE relation_4 INTO 'output-browser';
relation_7 = GROUP relation_1 BY site;
relation_6 = FOREACH relation_7 GENERATE $0, AVG($1.pages_visited),
SUM($1.visit_duration);
STORE relation_6 INTO 'output-site';
where you can see how the loop has been unrolled and all the
operations repeated for each key.
I hope that Piglet will help some of you write DRYer code. It
doesn't
solve all problems, and there are things which are not supported at
all yet, but with your help I think it can be a very good
companion to
Pig.
If you want to know more read the documentation on GitHub:
http://github.com/iconara/piglet, or send me a mail either through
GitHub, to my e-mail ([email protected]), or via Twitter (@iconara).
yours,
Theo