Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/6352#issuecomment-104847514
Oh, one other thought: maybe a good exercise would be to attempt to write
the Scaladoc comment for `getMissingParentStages` which describes, in prose,
the basic high-level algorithm for finding missing parent stages. I can help
with this tomorrow. Even if you don't end up modifying
`getMissingParentStages`, I'd love to submit a new PR that just comments /
explains the existing code in order to make this easier to understand in the
future.
To help me build some intuition for understanding your optimization here:
It looks like this only save us from performing `getCacheLocs` lookups in
cases where we're traversing backwards through a long chain of narrow
dependencies. I don't think that this is necessarily safe. Imagine that we
have a lineage graph which looks something like this:
```
âââââ shuffle âââââ âââââ
â A ââ â â â ââ B ââââââ C ââââ
âââââ âââââ âââââ â
âââââ
ââââ E â
âââââ â âââââ
â D ââââ
âââââ
```
Here, `E` has one-to-one dependencies on `C` and `D`. `C` is derived from
`A` by performing a shuffle and then a map. If we're trying to determine which
ancestor stages need to be computed in order to compute `E`, we need to figure
out whether the shuffle `A -> B` should be performed. If the RDD `C`, which
has only one ancestor via a narrow dependency, is cached, then we won't need to
compute `A`, even if it has some unavailable output partitions. The same goes
for `B`: if `B` is 100% cached, then we can avoid the shuffle on `A`. Based on
this, I don't think that we can make a local decision to skip the caching check
based on the structure of the RDD graph. However, we _might_ be able to skip /
optimize this check based on RDDs' storage levels: in long chains of narrow
dependencies, most RDDs probably _aren't_ cached, so adding a simple `if
StorageLevel = None return Seq.fill(numPartitions)(Nil)` check to
`getCacheLocs` might be safe / sufficient.
Someone more familiar with StorageLevel / caching semantics should
double-check this reasoning to make sure that I'm not overlooking any
corner-cases when RDDs' storage levels change due to unpersist / cache /
persist calls.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]