[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12842062#action_12842062
 ] 

Woody Anderson commented on PIG-928:
------------------------------------

Java reflection is very doable, it's kind of a pain i guess, but you could 
definitely do it. I think using BeanShell might be a way to use java syntax if 
you want to, but jython and jruby also are quite good at allowing you to call 
java code very easily and naturally.
What kind of reflection system are you thinking? passing a string as input to 
some function? or finding someway to assume you can make certain method calls 
on the objects that represent various data object in pig. e.g.  $0.split("."), 
assuming $0 is a chararray/string.
or are you thinking something that equates to:
def splitter java.util.regex.Pattern("\.");
A = foreach B generate splitter.split($0);

to have it perform at 'peak', you'd need to wrap the reflection into the 
constructor and cache the java.lang.reflect.Method object.
it wouldn't be too hard to write (the assumed impl uses constructor args to 
determine the correct Method via reflection):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 
'split', "\.", 'String', 'b:{tt:(t:chararray)}');
A = foreach B generate split($0);

to be more 'generic' but less performant, you could do it more like this (the 
assumed impl uses less info to simply reflect a particular object):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 
'split', "\.");
A = foreach B generate split('split', $0);

the issue here is that each invocation has to determine the correct Method 
object (after the first it's probably highly cacheable), also since the method 
might change as a result of a different name or different args, the lookup 
might also produce a different output schema. At any rate, i think you could 
write reasonably peformant caching code for this solution, but it'd be more 
complicated and a tag slower than the former approach.
Mainly i've tried in all of my impls to do as little as possible in the exec() 
method, and try to make most objects in use final and immutable (e.g. build 
them all in the constructor).

you could of course go so far as to delay the creation of the actual Pattern 
object (i.e. where you first present the split pattern "\."). Again, it lends 
itself to performance degrading coding patterns, but if you're careful with 
your actions, i think you could get most of it back with appropriately cached 
objects. Doing this in a completely generic fashion.. i'll think about it i 
guess, i think there's more overhead here than in the other approaches, but if 
your lib function is more than 'split', the overhead might not be noticeable. 
Of course, you could implement each of these abstractions levels and use them 
judiciously.

anyway, there are a lot of options here, are these in line with what you were 
thinking?

> UDFs in scripting languages
> ---------------------------
>
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to