[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Le Dem updated PIG-928: ------------------------------ Attachment: pyg.tgz Hi, I'm attaching something I implemented last year. I cleaned it up and updated the dependency to Pig 0.6.0 for the occasion. There's probably some overlap with previous posts, sorry about the late submission. Here is my approach. I wanted to make easier a couple of things: - writing programs that require multiple calls to pig - UDFs - parameter passing to Pig So I integrated Pig with Jython so that the whole program (main program, UDFs, Pig scripts) could be in one python script. example: python/tc.py in the attachment The script defines Python functions that are available as UDFs to pig automatically. The decorator @outputSchema is an easy way to specify what the output schema of the UDF is. example (see script): @outputSchema("relationships:{t:(target:chararray, candidate:chararray)}" Also notice that the UDFs use the standard python constructs: tuple, dictionary and list. they are converted to Pig constructs on the fly. This makes the definition of UDFs in Python very easy. Notice that the udf takes a list of arguments, not a tuple. The input tuple gets automatically mapped to the arguments. Then the script defines a main() function that will be the main program executed on the client. In the main the Python program has access to a global pig variable that provides two methods (for now) and is designed to be an equivalent to PigServer. List<ExecJob> executeScript(String script) to execute a pig script in-lined in Python deleteFile(String filename) to delete a file This looks a little bit like the JDBC approach where you "query" Pig and then can process the data. also you can embed python expressions in the pig statements using ${ ... } example: ${n - 1} They get executed in the current scope and replaced in the script. To run the example (assuming javac, jar and java are in your PATH): - tar xzvf pyg.tgz - add pig-0.6.0-core.jar to the lib folder - ./makejar.sh - ./runme.sh It runs the following: org.apache.pig.pyg.Pyg local tc.py tc.py is a python script that performs a transitive closure on a list of relation using an iterative algorithm. It defines python functions Limitations: - you can not include other python scripts but this should be doable. - I haven't spent much time testing performance. I suspect the Pig<->Python type conversion to be a little slow as it creates many new objects. It could possibly be improved by making the Pig objects implement the Python interfaces. (the attachment contains jython.jar 2.5.0 for simplicity) Best regards, Julien > UDFs in scripting languages > --------------------------- > > Key: PIG-928 > URL: https://issues.apache.org/jira/browse/PIG-928 > Project: Pig > Issue Type: New Feature > Reporter: Alan Gates > Attachments: package.zip, pyg.tgz, scripting.tgz, scripting.tgz > > > It should be possible to write UDFs in scripting languages such as python, > ruby, etc. This frees users from needing to compile Java, generate a jar, > etc. It also opens Pig to programmers who prefer scripting languages over > Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.