Theo, this is awesome. We at Twitter are hoping to contribute to and extend the great work you've done.
Kevin On Wed, Jan 13, 2010 at 10:01 PM, Theo Hultberg <[email protected]> wrote: > Please do! > > T# > > On Thu, Jan 14, 2010 at 12:02 AM, Alan Gates <[email protected]> wrote: > > Theo, > > > > This looks really interesting. Can I put a link to it on our page for > tools > > use with Pig, http://wiki.apache.org/pig/PigTools ? > > > > Alan. > > > > On Jan 13, 2010, at 10:38 AM, Theo Hultberg wrote: > > > >> Hi, > >> > >> I've written a Ruby DSL for writing Pig scripts, which I hope might > >> interest some of you. It makes it possible to do a lot of things you > >> can't do in Pig Latin, like loops, reuse code through functions, and > >> introspection on relation schemas. Basically you write some Ruby code > >> that looks a lot like Pig Latin, and you get the equivalent Pig Latin > >> as output. Loops are unrolled, functions are inlined, and so on. > >> > >> There's a lot of documentation and examples on GitHub: > >> http://github.com/iconara/piglet, and here are a few examples too: > >> > >> If you run this Ruby code through Piglet > >> > >> a = load 'input', :schema => [:x, :y] > >> b = a.group :x > >> store b, 'output' > >> > >> you will get the following Pig Latin code: > >> > >> relation_2 = LOAD 'input' AS (x, y); > >> relation_1 = GROUP relation_2 BY x; > >> STORE relation_1 INTO 'output'; > >> > >> More or less the same, don't you think? (Piglet can't determine the > >> names of the variables, unfortunately, thus the relation names are not > >> fantastic, I might get that working in a future version). > >> > >> I wrote Piglet when some Pig scripts I was working on started to get > >> very repetitive. I had a relation with a few fields that were keys and > >> a few that were numbers and I wanted to get the sums for each value of > >> each of the key fields. This meant having to repeat the same GROUP and > >> FOREACH operations once for each key, even though the only thing that > >> changed was the name of the field that I grouped by. Having to repeat > >> the same code again and again for every key was frustrating, and I > >> dreamed up a way of doing the same thing in Ruby. With Piglet I can > >> now do something like this: > >> > >> input = load('input', :schema => %w(country browser site > >> pages_visited visit_duration)) > >> > >> %w(country browser site).each do |dimension| > >> grouped = input.group(dimension).foreach do |r| > >> [ > >> r[0], > >> r[1].pages_visited.avg, > >> r[1].visit_duration.sum > >> ] > >> end > >> > >> store(grouped, "output-#{dimension}") > >> end > >> > >> which will be translated to this Pig Latin code: > >> > >> relation_1 = LOAD 'input' AS (country, browser, site, visit_duration); > >> relation_3 = GROUP relation_1 BY country; > >> relation_2 = FOREACH relation_3 GENERATE $0, AVG($1.pages_visited), > >> SUM($1.visit_duration); > >> STORE relation_2 INTO 'output-country'; > >> relation_5 = GROUP relation_1 BY browser; > >> relation_4 = FOREACH relation_5 GENERATE $0, AVG($1.pages_visited), > >> SUM($1.visit_duration); > >> STORE relation_4 INTO 'output-browser'; > >> relation_7 = GROUP relation_1 BY site; > >> relation_6 = FOREACH relation_7 GENERATE $0, AVG($1.pages_visited), > >> SUM($1.visit_duration); > >> STORE relation_6 INTO 'output-site'; > >> > >> where you can see how the loop has been unrolled and all the > >> operations repeated for each key. > >> > >> I hope that Piglet will help some of you write DRYer code. It doesn't > >> solve all problems, and there are things which are not supported at > >> all yet, but with your help I think it can be a very good companion to > >> Pig. > >> > >> If you want to know more read the documentation on GitHub: > >> http://github.com/iconara/piglet, or send me a mail either through > >> GitHub, to my e-mail ([email protected]), or via Twitter (@iconara). > >> > >> yours, > >> Theo > > > > >
