Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/526#issuecomment-91983935
  
    Ok. I think that this patch is ready for review.
    
    To address @hyunsik's comment, I added a class called ```FunctionInvoke```. 
This class describes how the functions are executed.
    
    On executing Python scripts, I used the approach of using an external UDF 
controller that is responsible for executing python scripts as commented above. 
When a submitted query involves one or more python UDFs, several UDF 
controllers are executed to compute UDFs. Input/output tuples are transmitted 
via stdio. This approach may have an issue on performance, but I think it is 
inevitable without using Jython.
    
    Currently, the controller is executed for each Python functions. That is, 
if a query involves 5 Python functions even some of them are same, at least 5 
different controllers are executed during query processing. I chose this 
architecture due to its simplicity. 
    
    Here are some highlights of changes.
    * ```AnyDatum``` is used to support Python's dynamic typing. 
    * ```PythonScriptEngine``` is responsible for maintaining the external 
controller process. To reduce overhead, the controller should be forked only 
when UDFs are actually evaluated. In this patch, there are three points where 
the controller is forked.
     * Constant folding optimization in Tajo master: During constant folding, 
some UDFs can be evaluated. If necessary, controllers are forked and 
immediately destroyed after evaluation.
     * Non-from query execution in Tajo master. If the query involves Python 
UDFs, controllers are forked during query processing.
     * Task execution in worker: If the plan of a stage involves Python UDFs, 
controllers are forked (destroyed) when a task starts up (shuts down). Due to 
the simplicity, I chose this architecture rather than sharing controllers among 
multiple tasks via ```ExecutionBlockSharedResource```.
    * Refactoring the ```EvalNode::bind()``` function. This function now 
receives ```EvalContext``` in addition to ```Schema```. ```EvalContext``` can 
contain some information given at runtime such as ```ScriptEngine``` started by 
each task.
    
    For reviewers, I apologize for a large patch. But many changes are related 
to just refactoring of the bind() function and renaming some functions. 
    Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to