[jira] Commented: (PIG-928) UDFs in scripting languages

Woody Anderson (JIRA) Thu, 04 Mar 2010 15:14:06 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841599#action_12841599
 ]


Woody Anderson commented on PIG-928:
------------------------------------

@Ashutosh
I don't think there is _any_ measurable overhead to the reflection mechanism in 
the example I provided. The objects are allocated "a few" times due to the 
schema interrogation logic of pig (something that might deserve an entire other 
bug thread of discussion, as i have no idea why X copies of a UDF have to be 
allocated for this).
When it comes time to run (i.e. where it really counts), there is a single 
invocation of the factory pattern followed by "huge" (data set derived) number 
of calls to that function. The UDF that is called is fully built an fully 
initialized with final variables etc, facilitating maximal streamlined 
execution.
There are certainly things about the approach i took, but language selection 
overhead is not one of them. If you have profiling numbers that suggest 
otherwise I'd be suitably surprised.

A secondary point to the whole idea of needing some script language code other 
than, say BSF or javax.script is the idea of type coercion. BSF/javax is not 
usable in a drop in manner. Each engine unfortunately consumes and produces 
objects in its own object model. If either of these frameworks had bothered to 
mandate converting input/output to java.util things would at least be easier, 
b/c we could convert from that to DataBag/Tuple in a unified manner, but this 
isn't the case. Thus conversion must be implemented per Engine, at which point, 
a conversion from PyArray to Tuple is more appropriate than PyArray -> List -> 
Tuple for performance concerns.
But, even for rudimentary correctness, type conversion must be implemented for 
each, at which point, a wrapping pattern that selects an appropriate function 
factory is a necessary pattern anyway.

@Alan/@Dmitriy
Orthogonal to the above point: The idea of trying to support multiple script 
languages vs. a few. I am personally not of the same mind as you guys i guess.
I think there is near zero 'overhead' perf cost for supporting some unspecified 
language. Languages continually evolve and new languages emerge that utilize 
the JVM better and better. I certainly agree that, at this time, jython and 
jruby seem the best. However, to say that clojure or javascript, or whatever 
are not going to move forward and potentially become more effectively 
integrated with the JVM is a bit premature.

I would make the sacrifice if the ability to support multiple languages was 
actually that hard, or had an actual serious performance cost.
I just don't think those two issues are real.

The performance costs come from the individual scripting engine features with 
respect to byte-code compliation, function referencing, string manipulation, 
execution caching etc.,  and their type coercion complexities.
That is completely different than the cost of PIG supporting multiple languages.
Also, supporting multiple languages is also not that hard. Arnab has thought 
about this, as have I. I think his ideas, while not perfect, offer a good 
avenue of exploration and moving forward that offers integration of PIG with 
any script language. It (importantly) offers to put those languages in PIG 
instead of the other way around, and it allows for multiple interpreter 
contexts and even multiple languages.

I'll quote Arnab's quick description here:
-----
DEFINE CMD `SCRIPTBLOCK` script('javascript')
This is identical to the commandline streaming syntax, and follows gracefully 
in the style of the "ship" and "cache" keywords. 

Thus your javascript example becomes
DEFINE JSBlock `
function split(a) {
  return a.split(" ");
}
` script('JAVASCRIPT');
Note the use of backticks is consistent with the current syntax, and is 
unlikely to occur in common scripts, so it saves us the escaping. Also 
it allows newlines in the code. 
The goal is to create namespaces -- you can now call your function as 
"JSBlock.split(a)". This allows us to have multiple functions in one block. 
-----

This idea, coupled with the ability to register files and directories directly 
(e.g. register foo.py;) provides the ability to load code into an arbitrary 
namespace/interpreter-scope, load it for an arbitrary language etc.
and the invocation syntax is nice and clean Block.foo() calls a method named 
foo in the interpreter.
To allow for the easy invocation syntax to perform well, we would need to cause 
it to execute in the same was as:
  define spig_split 
org.apache.pig.scripting.Eval('jython','split','b:{tt:(t:chararray)}');

i don't see that as particularly difficult modification of the function 
rationalization logic of pig. Actually, i think it's a general improvement as 
it cuts down on object allocations.

In the event that this methodology is adopted, you are then still free to write 
projects that stuff PIG inside python or ruby etc. But PIG itself remains an 
environment that plays well with multiple script engines.

conclusion:
I see it as quite achievable to support any given language with near zero 
overhead above the lang's scriptengine,
I thing it's quite doable to do this in a flexible model that allows them to be 
mixed together, even within the same script
I think that, overall this is highly preferable to a single or otherwise finite 
language situation (though i advocate possibly auto-supporting jython/jruby)


> UDFs in scripting languages
> ---------------------------
>
>                 Key: PIG-928
>                 URL: https://issues.apache.org/jira/browse/PIG-928
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Alan Gates
>         Attachments: package.zip, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python, 
> ruby, etc.  This frees users from needing to compile Java, generate a jar, 
> etc.  It also opens Pig to programmers who prefer scripting languages over 
> Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-928) UDFs in scripting languages

Reply via email to