Please do! T#
On Thu, Jan 14, 2010 at 12:02 AM, Alan Gates <[email protected]> wrote: > Theo, > > This looks really interesting. Can I put a link to it on our page for tools > use with Pig, http://wiki.apache.org/pig/PigTools ? > > Alan. > > On Jan 13, 2010, at 10:38 AM, Theo Hultberg wrote: > >> Hi, >> >> I've written a Ruby DSL for writing Pig scripts, which I hope might >> interest some of you. It makes it possible to do a lot of things you >> can't do in Pig Latin, like loops, reuse code through functions, and >> introspection on relation schemas. Basically you write some Ruby code >> that looks a lot like Pig Latin, and you get the equivalent Pig Latin >> as output. Loops are unrolled, functions are inlined, and so on. >> >> There's a lot of documentation and examples on GitHub: >> http://github.com/iconara/piglet, and here are a few examples too: >> >> If you run this Ruby code through Piglet >> >> a = load 'input', :schema => [:x, :y] >> b = a.group :x >> store b, 'output' >> >> you will get the following Pig Latin code: >> >> relation_2 = LOAD 'input' AS (x, y); >> relation_1 = GROUP relation_2 BY x; >> STORE relation_1 INTO 'output'; >> >> More or less the same, don't you think? (Piglet can't determine the >> names of the variables, unfortunately, thus the relation names are not >> fantastic, I might get that working in a future version). >> >> I wrote Piglet when some Pig scripts I was working on started to get >> very repetitive. I had a relation with a few fields that were keys and >> a few that were numbers and I wanted to get the sums for each value of >> each of the key fields. This meant having to repeat the same GROUP and >> FOREACH operations once for each key, even though the only thing that >> changed was the name of the field that I grouped by. Having to repeat >> the same code again and again for every key was frustrating, and I >> dreamed up a way of doing the same thing in Ruby. With Piglet I can >> now do something like this: >> >> input = load('input', :schema => %w(country browser site >> pages_visited visit_duration)) >> >> %w(country browser site).each do |dimension| >> grouped = input.group(dimension).foreach do |r| >> [ >> r[0], >> r[1].pages_visited.avg, >> r[1].visit_duration.sum >> ] >> end >> >> store(grouped, "output-#{dimension}") >> end >> >> which will be translated to this Pig Latin code: >> >> relation_1 = LOAD 'input' AS (country, browser, site, visit_duration); >> relation_3 = GROUP relation_1 BY country; >> relation_2 = FOREACH relation_3 GENERATE $0, AVG($1.pages_visited), >> SUM($1.visit_duration); >> STORE relation_2 INTO 'output-country'; >> relation_5 = GROUP relation_1 BY browser; >> relation_4 = FOREACH relation_5 GENERATE $0, AVG($1.pages_visited), >> SUM($1.visit_duration); >> STORE relation_4 INTO 'output-browser'; >> relation_7 = GROUP relation_1 BY site; >> relation_6 = FOREACH relation_7 GENERATE $0, AVG($1.pages_visited), >> SUM($1.visit_duration); >> STORE relation_6 INTO 'output-site'; >> >> where you can see how the loop has been unrolled and all the >> operations repeated for each key. >> >> I hope that Piglet will help some of you write DRYer code. It doesn't >> solve all problems, and there are things which are not supported at >> all yet, but with your help I think it can be a very good companion to >> Pig. >> >> If you want to know more read the documentation on GitHub: >> http://github.com/iconara/piglet, or send me a mail either through >> GitHub, to my e-mail ([email protected]), or via Twitter (@iconara). >> >> yours, >> Theo > >
