[
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841599#action_12841599
]
Woody Anderson commented on PIG-928:
------------------------------------
@Ashutosh
I don't think there is _any_ measurable overhead to the reflection mechanism in
the example I provided. The objects are allocated "a few" times due to the
schema interrogation logic of pig (something that might deserve an entire other
bug thread of discussion, as i have no idea why X copies of a UDF have to be
allocated for this).
When it comes time to run (i.e. where it really counts), there is a single
invocation of the factory pattern followed by "huge" (data set derived) number
of calls to that function. The UDF that is called is fully built an fully
initialized with final variables etc, facilitating maximal streamlined
execution.
There are certainly things about the approach i took, but language selection
overhead is not one of them. If you have profiling numbers that suggest
otherwise I'd be suitably surprised.
A secondary point to the whole idea of needing some script language code other
than, say BSF or javax.script is the idea of type coercion. BSF/javax is not
usable in a drop in manner. Each engine unfortunately consumes and produces
objects in its own object model. If either of these frameworks had bothered to
mandate converting input/output to java.util things would at least be easier,
b/c we could convert from that to DataBag/Tuple in a unified manner, but this
isn't the case. Thus conversion must be implemented per Engine, at which point,
a conversion from PyArray to Tuple is more appropriate than PyArray -> List ->
Tuple for performance concerns.
But, even for rudimentary correctness, type conversion must be implemented for
each, at which point, a wrapping pattern that selects an appropriate function
factory is a necessary pattern anyway.
@Alan/@Dmitriy
Orthogonal to the above point: The idea of trying to support multiple script
languages vs. a few. I am personally not of the same mind as you guys i guess.
I think there is near zero 'overhead' perf cost for supporting some unspecified
language. Languages continually evolve and new languages emerge that utilize
the JVM better and better. I certainly agree that, at this time, jython and
jruby seem the best. However, to say that clojure or javascript, or whatever
are not going to move forward and potentially become more effectively
integrated with the JVM is a bit premature.
I would make the sacrifice if the ability to support multiple languages was
actually that hard, or had an actual serious performance cost.
I just don't think those two issues are real.
The performance costs come from the individual scripting engine features with
respect to byte-code compliation, function referencing, string manipulation,
execution caching etc., and their type coercion complexities.
That is completely different than the cost of PIG supporting multiple languages.
Also, supporting multiple languages is also not that hard. Arnab has thought
about this, as have I. I think his ideas, while not perfect, offer a good
avenue of exploration and moving forward that offers integration of PIG with
any script language. It (importantly) offers to put those languages in PIG
instead of the other way around, and it allows for multiple interpreter
contexts and even multiple languages.
I'll quote Arnab's quick description here:
-----
DEFINE CMD `SCRIPTBLOCK` script('javascript')
This is identical to the commandline streaming syntax, and follows gracefully
in the style of the "ship" and "cache" keywords.
Thus your javascript example becomes
DEFINE JSBlock `
function split(a) {
return a.split(" ");
}
` script('JAVASCRIPT');
Note the use of backticks is consistent with the current syntax, and is
unlikely to occur in common scripts, so it saves us the escaping. Also
it allows newlines in the code.
The goal is to create namespaces -- you can now call your function as
"JSBlock.split(a)". This allows us to have multiple functions in one block.
-----
This idea, coupled with the ability to register files and directories directly
(e.g. register foo.py;) provides the ability to load code into an arbitrary
namespace/interpreter-scope, load it for an arbitrary language etc.
and the invocation syntax is nice and clean Block.foo() calls a method named
foo in the interpreter.
To allow for the easy invocation syntax to perform well, we would need to cause
it to execute in the same was as:
define spig_split
org.apache.pig.scripting.Eval('jython','split','b:{tt:(t:chararray)}');
i don't see that as particularly difficult modification of the function
rationalization logic of pig. Actually, i think it's a general improvement as
it cuts down on object allocations.
In the event that this methodology is adopted, you are then still free to write
projects that stuff PIG inside python or ruby etc. But PIG itself remains an
environment that plays well with multiple script engines.
conclusion:
I see it as quite achievable to support any given language with near zero
overhead above the lang's scriptengine,
I thing it's quite doable to do this in a flexible model that allows them to be
mixed together, even within the same script
I think that, overall this is highly preferable to a single or otherwise finite
language situation (though i advocate possibly auto-supporting jython/jruby)
> UDFs in scripting languages
> ---------------------------
>
> Key: PIG-928
> URL: https://issues.apache.org/jira/browse/PIG-928
> Project: Pig
> Issue Type: New Feature
> Reporter: Alan Gates
> Attachments: package.zip, scripting.tgz, scripting.tgz
>
>
> It should be possible to write UDFs in scripting languages such as python,
> ruby, etc. This frees users from needing to compile Java, generate a jar,
> etc. It also opens Pig to programmers who prefer scripting languages over
> Java.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.