Hello All,

I would like to submit the design for the Python execution operator before I 
raise the pull request so that I can refine the implementation based on 
feedback. Could you please provide feedback on the design if any and I will 
raise the PR accordingly. 

- This operator is for the JIRA ticket raised here 
https://issues.apache.org/jira/browse/APEXMALHAR-2260 
<https://issues.apache.org/jira/browse/APEXMALHAR-2260>
- The operator embeds a python interpreter in the operator JVM process space 
and is not external to the JVM.
- The implementation is proposing the use of Java Embedded Python ( JEP ) given 
here https://github.com/ninia/jep <https://github.com/ninia/jep>
- The JEP engine is under zlib/libpng license. Since this is an approved 
license under https://www.apache.org/legal/resolved.html#category-a 
<https://www.apache.org/legal/resolved.html#category-a> I am assuming it is ok 
for the community to approve the inclusion of this library  
- Python integration is a messy piece due to the nature of dynamic libraries. 
All python libraries need to be natively installed. This also means we will not 
be able bundle python libraries and dependencies as part of the build into the 
target JVM container. Hence this operator has the current limitation of the 
python binaries installed through an external process on all of the YARN nodes 
for now.
- The JEP maven dependency jar in the POM is a JNI wrapper around the dynamic 
library that is installed externally to the Apex installation process on all of 
the YARN nodes.
- Hope to take up https://issues.apache.org/jira/browse/APEXCORE-796 
<https://issues.apache.org/jira/browse/APEXCORE-796> to solve this issue in the 
future.
- The python operator implementation can be extended to py4J based 
implementation ( as opposed to in-memory model like JEP ) in the future if 
required be. JEP is the implementation based on an in-memory design pattern.
- The python operator allows for 4 major API patterns
    - Execute a method call by accepting parameters to pass to the interpreter
    - Execute a python script as given in a file path
    - Evaluate an expression and allows for passing of variables between the 
java code and the python in-memory interpreter bridge
    - A handy method wherein a series of instructions can be passed in one 
single java call ( executed as a sequence of python eval instructions under the 
hood ) 
- Automatic garbage collection of the variables that are passed from java code 
to the in memory python interpreter
- Support for all major python libraries. Tensorflow, Keras, Scikit, xgboost. 
Preliminary tests for these libraries seem to work as per code here : 
https://github.com/ananthc/sampleapps/tree/master/apache-apex/apexjvmpython 
<https://github.com/ananthc/sampleapps/tree/master/apache-apex/apexjvmpython> 
- The implementation allows for SLA based execution model. i.e. the operator is 
given a chance to execute the python code and if not complete within a time 
out, the operator code returns back null.
- A tuple that has become a straggler as per previous point will automatically 
be drained off to a different port so that downstream operators can still 
consume the straggler if they want to when the results arrive.
- Because of the nature of python being an interpreter and if a previous tuple 
is being still processed, there is chance of a back pressure pattern building 
up very quickly. Hence this operator works on the concept of a worker pool. The 
Python operator uses a configurable number of worker thread each of which embed 
the Python interpreter within their processing space. i.e. it is in fact a 
collection of python ink memory interpreters inside the Python operator 
implementation.
- The operator chooses one of the threads at runtime basing on their busy state 
thus allowing for back-pressure issues to be resolved automatically.
- There is a first class support for Numpy in JEP. Java arrays would be 
convertible to the Python Numpy arrays and vice versa and share the same memory 
addresses for efficiency reasons. 
- The base operator implements dynamic partitioning based on a thread 
starvation policy. At each checkpoint, it checks how much percentage of the 
requests resulted in starved threads and if the starvation exceeds a configured 
percentage, a new instance of the operator is provisioned for every such 
instance of the operator
- The operator provides the notion of a worker execution mode. There are two 
worker modes that are passed in each of the above calls from the user. ALL or 
ANY.  Because python interpreter is state based engine, a newly dynamically 
partitioned operator might not be in the exact state of the remaining 
operators. Hence the operator has this notion of worker execution mode. Any 
call ( any of the 4 calls mentioned above ) called with ALL execution mode will 
be executed on all the workers of the worker thread pool as well as the 
dynamically portioned instance whenever such an instance is provisioned.  
- The base operator implementation has a method that can be overridden to 
implement the logic that needs to be executed for each tuple. The base operator 
default implementation is a simple NO-OP.
- The operator automatically picks up the least busy of the thread pool worker 
which has JEP embedded in it to execute the call. 
- The JEP based installation will not support non Cpython modules. All of the 
major python libraries are cpython based and hence I believe this is of a 
lesser concern. If we hit a roadblock when a new python library being a 
non-Cpython based library needs to be run, then we could implement the 
ApexPythonEngine interface to something like Py4J which involves interprocess 
communication. 
- The python operator requires the user to set the library path 
java.library.path for the operator to make use of the dynamic libraries of the 
corresponding platform. This has to be passed in as the JVM options. Failing to 
do so will result in the operator failing to load the interpreter properly. 
- The supported python versions are 2.7, 3.3 , 3.4 , 3.5 and 3.6. Numpy >= 1.7 
is supported.
- There is no support for virtual environments yet. In case of multiple python 
versions on the node, to include the right python version for the apex 
operator, ensure that the environment variables and the dynamic library path 
are set appropriately. This is a workaround and I hope APEXCORE-796 will solve 
this issue as well. 


Regards,
Ananth 

Reply via email to