Github user jihoonson commented on the pull request:
https://github.com/apache/tajo/pull/526#issuecomment-91983935
Ok. I think that this patch is ready for review.
To address @hyunsik's comment, I added a class called ```FunctionInvoke```.
This class describes how the functions are executed.
On executing Python scripts, I used the approach of using an external UDF
controller that is responsible for executing python scripts as commented above.
When a submitted query involves one or more python UDFs, several UDF
controllers are executed to compute UDFs. Input/output tuples are transmitted
via stdio. This approach may have an issue on performance, but I think it is
inevitable without using Jython.
Currently, the controller is executed for each Python functions. That is,
if a query involves 5 Python functions even some of them are same, at least 5
different controllers are executed during query processing. I chose this
architecture due to its simplicity.
Here are some highlights of changes.
* ```AnyDatum``` is used to support Python's dynamic typing.
* ```PythonScriptEngine``` is responsible for maintaining the external
controller process. To reduce overhead, the controller should be forked only
when UDFs are actually evaluated. In this patch, there are three points where
the controller is forked.
* Constant folding optimization in Tajo master: During constant folding,
some UDFs can be evaluated. If necessary, controllers are forked and
immediately destroyed after evaluation.
* Non-from query execution in Tajo master. If the query involves Python
UDFs, controllers are forked during query processing.
* Task execution in worker: If the plan of a stage involves Python UDFs,
controllers are forked (destroyed) when a task starts up (shuts down). Due to
the simplicity, I chose this architecture rather than sharing controllers among
multiple tasks via ```ExecutionBlockSharedResource```.
* Refactoring the ```EvalNode::bind()``` function. This function now
receives ```EvalContext``` in addition to ```Schema```. ```EvalContext``` can
contain some information given at runtime such as ```ScriptEngine``` started by
each task.
For reviewers, I apologize for a large patch. But many changes are related
to just refactoring of the bind() function and renaming some functions.
Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---