[
https://issues.apache.org/jira/browse/TAJO-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491320#comment-14491320
]
ASF GitHub Bot commented on TAJO-1344:
--------------------------------------
Github user jihoonson commented on the pull request:
https://github.com/apache/tajo/pull/526#issuecomment-91983935
Ok. I think that this patch is ready for review.
To address @hyunsik's comment, I added a class called ```FunctionInvoke```.
This class describes how the functions are executed.
On executing Python scripts, I used the approach of using an external UDF
controller that is responsible for executing python scripts as commented above.
When a submitted query involves one or more python UDFs, several UDF
controllers are executed to compute UDFs. Input/output tuples are transmitted
via stdio. This approach may have an issue on performance, but I think it is
inevitable without using Jython.
Currently, the controller is executed for each Python functions. That is,
if a query involves 5 Python functions even some of them are same, at least 5
different controllers are executed during query processing. I chose this
architecture due to its simplicity.
Here are some highlights of changes.
* ```AnyDatum``` is used to support Python's dynamic typing.
* ```PythonScriptEngine``` is responsible for maintaining the external
controller process. To reduce overhead, the controller should be forked only
when UDFs are actually evaluated. In this patch, there are three points where
the controller is forked.
* Constant folding optimization in Tajo master: During constant folding,
some UDFs can be evaluated. If necessary, controllers are forked and
immediately destroyed after evaluation.
* Non-from query execution in Tajo master. If the query involves Python
UDFs, controllers are forked during query processing.
* Task execution in worker: If the plan of a stage involves Python UDFs,
controllers are forked (destroyed) when a task starts up (shuts down). Due to
the simplicity, I chose this architecture rather than sharing controllers among
multiple tasks via ```ExecutionBlockSharedResource```.
* Refactoring the ```EvalNode::bind()``` function. This function now
receives ```EvalContext``` in addition to ```Schema```. ```EvalContext``` can
contain some information given at runtime such as ```ScriptEngine``` started by
each task.
For reviewers, I apologize for a large patch. But many changes are related
to just refactoring of the bind() function and renaming some functions.
Thanks.
> Python UDF support
> ------------------
>
> Key: TAJO-1344
> URL: https://issues.apache.org/jira/browse/TAJO-1344
> Project: Tajo
> Issue Type: New Feature
> Components: function/udf
> Reporter: Hyunsik Choi
> Assignee: Jihoon Son
> Fix For: 0.11.0
>
> Attachments: TAJO-1344.patch, TAJO-1344_2.patch, TAJO-1344_3.patch,
> TAJO-1344_4.patch
>
>
> Python has abundant users and third-party libraries. This language is widely
> used in data analytic area. So, it would be great if Tajo supports Python UDF.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)