Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/9241#issuecomment-150999439
Hey @davies,
This patch makes sense to me at a high-level, but I have a few questions:
- Could you add a description to this pull request and share your list of
remaining TODOs with me (maybe by posting it as a checklist in the PR
description)?
- Can you comment on the thread-safety concerns here? My current
understanding is that we don't have to worry about memory-manager-triggered
spills racing with other interactions on the spillables because the iterator
model provides some implicit synchronization. In order to convince ourselves
that this is safe in all cases, however, I'd like to think through two
corner-cases:
- What happens if a single task contains multiple threads? Currently,
this can happen in PythonRDD, PipedRDD, and a couple of other places. All of
these cases are situations where we have a writer or reader thread for
interacting with an external process. Although we have separate threads, they
are somewhat synchronous / coupled via their interaction with the external
process. This could be tricky, though, so I'd like to talk through some
examples to make sure we've covered all of the tricky cases.
- What happens if operator B is in the middle of processing a `next()`
call on its iterator, which calls it's parent's `next()` method, which requires
memory to grow, which triggers a spill that drains memory from A and
de-allocates or spills data structures that it's relying on? Do we have to make
any distinctions between asking an upstream operator to spill versus a
downstream one?
- Are there any risks of deadlocks with the extra synchronization added
here?
I'm going to focus on merging my memory manager unification patch tonight
so that we can start rebasing this tomorrow.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]