Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/10705#issuecomment-186411074
@zsxwing, one problem is the avoidance of double-release of locks. Consider
the case where you read a block from the block manager, do a bunch of pipelined
filters and maps on it, and then do a `take` or `limit`: the `take` or `limit`
operator doesn't know the block ids of the blocks that are being read and thus
doesn't know which blocks to register with the task completion callback. Even
if we did know which blocks to register, we'd have to then guard against
double-release of a block in case the block would be freed by
CompletionIterator (in the case where the take() ends up consuming the entire
iterator).
The more principled approach to handling `take()` and `limit()` is to use
an internal iterator interface which supports an explicit `close()` operation
to signal the fact that the rest of the iterator will not be consumed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]