[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PIG-928:
------------------------------

    Attachment: pyg.tgz

Hi,
I'm attaching something I implemented last year. I cleaned it up and updated 
the dependency to Pig 0.6.0 for the occasion.
There's probably some overlap with previous posts, sorry about the late 
submission.
Here is my approach.
I wanted to make easier a couple of things:
 - writing programs that require multiple calls to pig
 - UDFs
 - parameter passing to Pig
So I integrated Pig with Jython so that the whole program (main program, UDFs, 
Pig scripts) could be in one python script.
example: python/tc.py in the attachment

The script defines Python functions that are available as UDFs to pig 
automatically. The decorator @outputSchema is an easy way to specify what the 
output schema of the UDF is.
example (see script): @outputSchema("relationships:{t:(target:chararray, 
candidate:chararray)}"
Also notice that the UDFs use the standard python constructs: tuple, dictionary 
and list. they are converted to Pig constructs on the fly. This makes the 
definition of UDFs in Python very easy. Notice that the udf takes a list of 
arguments, not a tuple. The input tuple gets automatically mapped to the 
arguments.

Then the script defines a main() function that will be the main program 
executed on the client.
In the main the Python program has access to a global pig variable that 
provides two methods (for now) and is designed to be an equivalent to PigServer.
List<ExecJob> executeScript(String script)
to execute a pig script in-lined in Python
deleteFile(String filename)
to delete a file
This looks a little bit like the JDBC approach where you "query" Pig and then 
can process the data.

also you can embed python expressions in the pig statements using ${ ... }
example: ${n - 1}
They get executed in the current scope and replaced in the script. 

To run the example (assuming javac, jar and java are in your PATH):
 - tar xzvf pyg.tgz
 - add pig-0.6.0-core.jar to the lib folder
 - ./makejar.sh
 - ./runme.sh

It runs the following:
org.apache.pig.pyg.Pyg local tc.py

tc.py is a python script that performs a transitive closure on a list of 
relation using an iterative algorithm. It defines python functions

Limitations:
 - you can not include other python scripts but this should be doable.
 - I haven't spent much time testing performance. I suspect the Pig<->Python 
type conversion to be a little slow as it creates many new objects. It could 
possibly be improved by making the Pig objects implement the Python interfaces.

(the attachment contains jython.jar 2.5.0 for simplicity)

Best regards, Julien

> UDFs in scripting languages
> ---------------------------
>
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, pyg.tgz, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to