[
https://issues.apache.org/jira/browse/PIG-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Coveney updated PIG-2317:
----------------------------------
Attachment: pigjruby.rb
PigUdf.rb
jruby_scripting_5.patch
Ok! I have made a lot of progress in this. Algebraic Ruby UDF's are WORKING,
you just have to make sure to set -Xmx 1024m (or more). Pigs are not lean
creatures :)
My to do's:
- Allow for in-line UDFs
- Expose an accumulator interface (should be easy peasy)
- Think of a better way to handle DataBags in Ruby, given that currently tuples
and databags are the same structure... I think I know what to do, but need to
ruminate.
- Make it compatible with 0.9.1 at least
- Write tests (is there a way to force a UDF which imeplements Algebraic and
Accumulator to be run as an Accumulator UDF, or a normal EvalFunc? this is
important for testing)
- Anything you guys thing would be useful
One thing that I did to get this to work was to generate an abstract
"AccEvalFunc" and "AlgEvalFunc," where if you define the pieces of that
interface, you get the lower levels for free. So in the case of the ruby
algebraic UDF, by simply defining "initial" "intermed" and "final," you get
accumulator and exec for free and don't have to putz around defining them. I
think this should at least go in the piggybank, but it needs some eyeballs and
suggestions.
To allay confusion, I'm going to inline "pigruby.rb" and explain how it works.
{code}
require 'PigUdf'
Helloworld=PigUdf.evalfunc("word:chararray") do
"Hello, world"
end
{code}
First off, it's important to require 'PigUdf' so ruby can work it's magic. In
order to facilitate easy Udf definitions, you can declare functions in this
way. You cannot have a schema which refers to a function, however, and the Udf
name MUST begin with a capital letter.
{code}
Complex=PigUdf.evalfunc("word:chararray,num:long") do |word|
[word.to_s,word.length]
end
{code}
Conveniently, you can ask for as many parameters as you like. Varargs aren't
supported yet, and it's not super high on the list but if you want it, let me
know.
{code}
Divbythree=PigUdf.filterfunc do |num|
num%3==0
end
{code}
Much like an evalfunc, you can easy make a filterfunc.
{code}
class Myudfs < PigUdf
outputSchema "val:long"
def cumsum num
@x||=0
@x+=num
end
outputSchemaFunction :squareSchema
def square num
return num**2
end
def squareSchema input
return input
end
filterFunc
def divbytwo input
input%2==0
end
end
{code}
For more complicated udfs, declaring a class that extends PigUdf is the way to
go. In order to declare a UDF, you have to declare it's schema. There are two
options: outputSchema "schema" will set that schema for the next function you
define, and register it as a udf. or you can do outputSchema :funcname,
"schema" to register funcname as a udf and schema as it's schema. If your
function has a schema dependent on the input, then you can use
outputSchemaFunction :funcname, and the next defined function will be
registered as a udf with funcname as it's schema function. Or as above, you can
do outputSchemaFunction :functoregister, :funcname. If you do filterFunc, the
next function is a filterFunc.
{code}
class COUNT < AlgPigUdf
outputSchema "val:long"
class Initial
def exec item
1
end
end
class Intermed
def exec items
items.flatten.inject(:+)
end
end
class Final < Intermed
end
end
{code}
This is the algebraic interface. The class name will be the UDF name in Pig.
Initial's exec function will be passed one item at a time. You do not have to
deal with the tuple that contains a bag that contains a tuple that contains an
item madness.
Intermed's exec function will be passed a "databag" that contains "tuples,"
which means it will be a list of lists of items (which is why I flatten items
before summing). you return the item, and it will be put in a tuple for you.
Final's exec function will do the exact same thing as Intermed, and the output
just won't be wrapped in a TUple. this is why in the examples, it just extends
Intermed.
{code}
class SUM < AlgPigUdf
outputSchema "val:long"
class Initial
def exec item
item
end
end
class Intermed
def exec items
items.flatten.inject(:+)
end
end
class Final < Intermed
end
end
{code}
This example works exactly like the above, except it sums the values, and not
the rows.
{code}
class WORDCOUNT < AlgPigUdf
outputSchema "val:long"
class Initial
def exec item
item ? item.split.length : 0
end
end
class Intermed
def exec items
items.flatten.inject(:+)
end
end
class Final < Intermed
end
end
{code}
Of course, what would any example be without a word count example.
Please let me know if you run into bugs, or have any suggestions on the code
itself, the interface, etc.
> Ruby/Jruby UDFs
> ---------------
>
> Key: PIG-2317
> URL: https://issues.apache.org/jira/browse/PIG-2317
> Project: Pig
> Issue Type: New Feature
> Reporter: Jacob Perkins
> Assignee: Jacob Perkins
> Priority: Minor
> Fix For: 0.9.2
>
> Attachments: PigUdf.rb, PigUdf.rb, jruby_scripting.patch,
> jruby_scripting_2_real.patch, jruby_scripting_3.patch,
> jruby_scripting_4.patch, jruby_scripting_5.patch, pigjruby.rb, pigjruby.rb
>
>
> It should be possible to write UDFs in Ruby. These UDFs will be registered in
> the same way as python and javascript UDFs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira