Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/6990#issuecomment-120515683
@dibbhatt Thanks for submitting the fix. I spoke offline with @tdas about
this issue and came to the following conclusions.
**Problem summary.** The issue seems to be a disconnect between what the
`BlockManagerMaster` knows and what the `ReceiverTracker` knows. If a block
fails to unroll, the receiver tracker will never know about the block and will
not include it in a future computation. In the mean time, however, the block
may be replicated and take up space on other executors even though it will
never be used.
**Implications for Spark core.** For Spark core, however, it is reasonable
to replicate a block even if it fails to unroll. Just because there is not
enough memory to cache this block on this executor doesn't mean the same is
true on a different executor. This is all best effort, but a future computation
of the RDD will still benefit from having the cached block somewhere. (Note:
the existing code doesn't actually do this yet because `CacheManager` has its
own unrolling logic. But replication even when unrolling fails is something we
want in the future for normal RDD caching.)
**Alternative fix.** The right fix in Spark streaming would be to have the
`ReceiverTracker` just read its blocks from the `BlockManagerMaster`. This
simplifies the two divergent block reporting code paths. Since the
`BlockManagerMaster` is notified of replicated blocks, the replication here
will also help mitigate data loss in the case of `MEMORY_ONLY_*`.
**TL;DR.** This patch removes a small feature from block manager that,
though not currently used, is desirable in the future. However, the underlying
issue is not caused by a bug in the block manager, but an incorrect assumption
in the `ReceiverTracker` that doesn't take into account replication. The
correct way forward would be to fix this in Spark streaming by refactoring the
`ReceiverTracker` to depend on `BlockManagerMaster`.
In short, we should consider an alternative fix that is longer-term and
more correct.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]