[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

rxin Thu, 17 Sep 2015 01:31:21 -0700

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/8662#issuecomment-141004821
  
    OK I finally understood what's going on here. If I understand your 
intention correctly, you are assuming:
    
    1. The 1st read goes into Python
    2. The 2nd read can keep up and won't lag behind too much of the 1st read
    
    So the memory consumption is somewhat bounded.
    
    If you are doing this, can't you simplify this a lot by removing all the 
synchronization, the owner thing, and the RDD, and just in the mapPartitions 
call:
    
    1. consume the input iterator, add it to two queues
    2. python reading from one queue
    3. the 2nd read (zip) read from the 2nd queue
    
    You mentioned earlier there might be a chance to deadlock. I don't 
understand why it would deadlock. But even if it could deadlock with the above 
approach, you can change it slightly to  you can achieve an identical thing as 
what you have by doing the following in the mapPartitions call:
    
    1. create an iterator wrapper that adds each consumed record to a blocking 
queue
    2. for each output record from python, drain an element from the queue




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8632] [SQL] [PYSPARK] Poor Python UDF p...

Reply via email to