Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8662#issuecomment-141004821
OK I finally understood what's going on here. If I understand your
intention correctly, you are assuming:
1. The 1st read goes into Python
2. The 2nd read can keep up and won't lag behind too much of the 1st read
So the memory consumption is somewhat bounded.
If you are doing this, can't you simplify this a lot by removing all the
synchronization, the owner thing, and the RDD, and just in the mapPartitions
call:
1. consume the input iterator, add it to two queues
2. python reading from one queue
3. the 2nd read (zip) read from the 2nd queue
You mentioned earlier there might be a chance to deadlock. I don't
understand why it would deadlock. But even if it could deadlock with the above
approach, you can change it slightly to you can achieve an identical thing as
what you have by doing the following in the mapPartitions call:
1. create an iterator wrapper that adds each consumed record to a blocking
queue
2. for each output record from python, drain an element from the queue
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]