[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

justinuang Thu, 17 Sep 2015 08:13:12 -0700

Github user justinuang commented on the pull request:

    https://github.com/apache/spark/pull/8662#issuecomment-141117878
  
    The solution with the iterator wrapper was my first approach that I 
prototyped 
(http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html).
 It's dangerous because there is buffering at many levels, in which case we can 
run into a deadlock situation.
    
        - NEW: the ForkingIterator LinkedBlockingDeque
        - batching the rows before pickling them
        - os buffers on both sides
        - pyspark.serializers.BatchedSerializer
    
        We can avoid deadlock by being very disciplined. For example, we can 
have the ForkingIterator instead always do a check of whether the 
LinkedBlockingDeque is full and if so:
    
        Java
            - flush the java pickling buffer
            - send a flush command to the python process
            - os.flush the java side
    
        Python
            - flush BatchedSerializer
            - os.flush()
    
    I'm not sure that this udf performance regression for one UDF is going to 
hit many people. For one, most upstreams are not a range() call, which doesn't 
have to go back to disk and deserialize. My personal opinion is that the 
blocking performance shouldn't be the reason that we reject this approach, but 
because it adds complexity.
    
    If we want a quick fix that is safe, I would be in favor of passing the 
row, which indeed is slower, but better than deadlocking or calculating 
upstream twice. It's just that the current system is unacceptable.
    
    Maybe we can also consider going with a complete architecture shift that 
goes with a batching system, but uses thrift to serialize the scala types to a 
language agnostic format, and also handle the blocking RPC. Then we can have 
PySpark and SparkR using the same simple UDF architecture. The main drawback is 
that I'm not sure how we're going to support broadcast variables or 
aggregators, but should those even be supported with UDFs?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

Reply via email to