[ 
https://issues.apache.org/jira/browse/PIG-2317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-2317:
----------------------------------

    Attachment: pigjruby.rb
                PigUdf.rb
                jruby_scripting_5.patch

Ok! I have made a lot of progress in this. Algebraic Ruby UDF's are WORKING, 
you just have to make sure to set -Xmx 1024m (or more). Pigs are not lean 
creatures :)

My to do's:
- Allow for in-line UDFs
- Expose an accumulator interface (should be easy peasy)
- Think of a better way to handle DataBags in Ruby, given that currently tuples 
and databags are the same structure... I think I know what to do, but need to 
ruminate.
- Make it compatible with 0.9.1 at least
- Write tests (is there a way to force a UDF which imeplements Algebraic and 
Accumulator to be run as an Accumulator UDF, or a normal EvalFunc? this is 
important for testing)
- Anything you guys thing would be useful

One thing that I did to get this to work was to generate an abstract 
"AccEvalFunc" and "AlgEvalFunc," where if you define the pieces of that 
interface, you get the lower levels for free. So in the case of the ruby 
algebraic UDF, by simply defining "initial" "intermed" and "final," you get 
accumulator and exec for free and don't have to putz around defining them. I 
think this should at least go in the piggybank, but it needs some eyeballs and 
suggestions.

To allay confusion, I'm going to inline "pigruby.rb" and explain how it works.

{code}
require 'PigUdf'

Helloworld=PigUdf.evalfunc("word:chararray") do
  "Hello, world"
end
{code}
First off, it's important to require 'PigUdf' so ruby can work it's magic. In 
order to facilitate easy Udf definitions, you can declare functions in this 
way. You cannot have a schema which refers to a function, however, and the Udf 
name MUST begin with a capital letter.

{code}
Complex=PigUdf.evalfunc("word:chararray,num:long") do |word|
  [word.to_s,word.length]
end
{code}
Conveniently, you can ask for as many parameters as you like. Varargs aren't 
supported yet, and it's not super high on the list but if you want it, let me 
know.

{code}
Divbythree=PigUdf.filterfunc do |num|
  num%3==0
end
{code}
Much like an evalfunc, you can easy make a filterfunc.

{code}
class Myudfs < PigUdf
  outputSchema "val:long"
  def cumsum num
    @x||=0
    @x+=num
  end
  
  outputSchemaFunction :squareSchema
  def square num
    return num**2
  end

  def squareSchema input
    return input
  end
  
  filterFunc
  def divbytwo input
    input%2==0
  end
end
{code}
For more complicated udfs, declaring a class that extends PigUdf is the way to 
go. In order to declare a UDF, you have to declare it's schema. There are two 
options: outputSchema "schema" will set that schema for the next function you 
define, and register it as a udf. or you can do outputSchema :funcname, 
"schema" to register funcname as a udf and schema as it's schema. If your 
function has a schema dependent on the input, then you can use 
outputSchemaFunction :funcname, and the next defined function will be 
registered as a udf with funcname as it's schema function. Or as above, you can 
do outputSchemaFunction :functoregister, :funcname. If you do filterFunc, the 
next function is a filterFunc.

{code}
class COUNT < AlgPigUdf
  outputSchema "val:long"

  class Initial
    def exec item
      1
    end
  end
  
  class Intermed
    def exec items
      items.flatten.inject(:+)
    end
  end
  
  class Final < Intermed
  end
end
{code}

This is the algebraic interface. The class name will be the UDF name in Pig.

Initial's exec function will be passed one item at a time. You do not have to 
deal with the tuple that contains a bag that contains a tuple that contains an 
item madness.

Intermed's exec function will be passed a "databag" that contains "tuples," 
which means it will be a list of lists of items (which is why I flatten items 
before summing). you return the item, and it will be put in a tuple for you.

Final's exec function will do the exact same thing as Intermed, and the output 
just won't be wrapped in a TUple. this is why in the examples, it just extends 
Intermed.

{code}
class SUM < AlgPigUdf
  outputSchema "val:long"
  
  class Initial
    def exec item
      item
    end
  end
  
  class Intermed
    def exec items
      items.flatten.inject(:+)
    end
  end
  
  class Final < Intermed
  end
end
{code}
This example works exactly like the above, except it sums the values, and not 
the rows.
{code}
class WORDCOUNT < AlgPigUdf
  outputSchema "val:long"

  class Initial
    def exec item
      item ? item.split.length : 0
    end
  end

  class Intermed
    def exec items
      items.flatten.inject(:+)
    end
  end

  class Final < Intermed
  end
end
{code}
Of course, what would any example be without a word count example.

Please let me know if you run into bugs, or have any suggestions on the code 
itself, the interface, etc.
                
> Ruby/Jruby UDFs
> ---------------
>
>                 Key: PIG-2317
>                 URL: https://issues.apache.org/jira/browse/PIG-2317
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Jacob Perkins
>            Assignee: Jacob Perkins
>            Priority: Minor
>             Fix For: 0.9.2
>
>         Attachments: PigUdf.rb, PigUdf.rb, jruby_scripting.patch, 
> jruby_scripting_2_real.patch, jruby_scripting_3.patch, 
> jruby_scripting_4.patch, jruby_scripting_5.patch, pigjruby.rb, pigjruby.rb
>
>
> It should be possible to write UDFs in Ruby. These UDFs will be registered in 
> the same way as python and javascript UDFs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to