Hi, I've written a Ruby DSL for writing Pig scripts, which I hope might interest some of you. It makes it possible to do a lot of things you can't do in Pig Latin, like loops, reuse code through functions, and introspection on relation schemas. Basically you write some Ruby code that looks a lot like Pig Latin, and you get the equivalent Pig Latin as output. Loops are unrolled, functions are inlined, and so on.
There's a lot of documentation and examples on GitHub: http://github.com/iconara/piglet, and here are a few examples too: If you run this Ruby code through Piglet a = load 'input', :schema => [:x, :y] b = a.group :x store b, 'output' you will get the following Pig Latin code: relation_2 = LOAD 'input' AS (x, y); relation_1 = GROUP relation_2 BY x; STORE relation_1 INTO 'output'; More or less the same, don't you think? (Piglet can't determine the names of the variables, unfortunately, thus the relation names are not fantastic, I might get that working in a future version). I wrote Piglet when some Pig scripts I was working on started to get very repetitive. I had a relation with a few fields that were keys and a few that were numbers and I wanted to get the sums for each value of each of the key fields. This meant having to repeat the same GROUP and FOREACH operations once for each key, even though the only thing that changed was the name of the field that I grouped by. Having to repeat the same code again and again for every key was frustrating, and I dreamed up a way of doing the same thing in Ruby. With Piglet I can now do something like this: input = load('input', :schema => %w(country browser site pages_visited visit_duration)) %w(country browser site).each do |dimension| grouped = input.group(dimension).foreach do |r| [ r[0], r[1].pages_visited.avg, r[1].visit_duration.sum ] end store(grouped, "output-#{dimension}") end which will be translated to this Pig Latin code: relation_1 = LOAD 'input' AS (country, browser, site, visit_duration); relation_3 = GROUP relation_1 BY country; relation_2 = FOREACH relation_3 GENERATE $0, AVG($1.pages_visited), SUM($1.visit_duration); STORE relation_2 INTO 'output-country'; relation_5 = GROUP relation_1 BY browser; relation_4 = FOREACH relation_5 GENERATE $0, AVG($1.pages_visited), SUM($1.visit_duration); STORE relation_4 INTO 'output-browser'; relation_7 = GROUP relation_1 BY site; relation_6 = FOREACH relation_7 GENERATE $0, AVG($1.pages_visited), SUM($1.visit_duration); STORE relation_6 INTO 'output-site'; where you can see how the loop has been unrolled and all the operations repeated for each key. I hope that Piglet will help some of you write DRYer code. It doesn't solve all problems, and there are things which are not supported at all yet, but with your help I think it can be a very good companion to Pig. If you want to know more read the documentation on GitHub: http://github.com/iconara/piglet, or send me a mail either through GitHub, to my e-mail ([email protected]), or via Twitter (@iconara). yours, Theo
