Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/6352#issuecomment-104847514
  
    Oh, one other thought: maybe a good exercise would be to attempt to write 
the Scaladoc comment for `getMissingParentStages` which describes, in prose, 
the basic high-level algorithm for finding missing parent stages.  I can help 
with this tomorrow.  Even if you don't end up modifying 
`getMissingParentStages`, I'd love to submit a new PR that just comments / 
explains the existing code in order to make this easier to understand in the 
future.
    
    To help me build some intuition for understanding your optimization here:
    
    It looks like this only save us from performing `getCacheLocs` lookups in 
cases where we're traversing backwards through a long chain of narrow 
dependencies.  I don't think that this is necessarily safe.  Imagine that we 
have a lineage graph which looks something like this:
    
    ```
    ┌───┐ shuffle ┌───┐    ┌───┐          
    │ A │◀ ─ ─ ─ ─│ B │◀───│ C │◀─┐       
    └───┘         └───┘    └───┘  │  
┌───┐
                                  ├──│ E │
                           ┌───┐  │  └───┘
                           │ D │◀─┘       
                           └───┘              
    ```
    
    Here, `E` has one-to-one dependencies on `C` and `D`.  `C` is derived from 
`A` by performing a shuffle and then a map.  If we're trying to determine which 
ancestor stages need to be computed in order to compute `E`, we need to figure 
out whether the shuffle `A -> B` should be performed.  If the RDD `C`, which 
has only one ancestor via a narrow dependency, is cached, then we won't need to 
compute `A`, even if it has some unavailable output partitions.  The same goes 
for `B`: if `B` is 100% cached, then we can avoid the shuffle on `A`.  Based on 
this, I don't think that we can make a local decision to skip the caching check 
based on the structure of the RDD graph.  However, we _might_ be able to skip / 
optimize this check based on RDDs' storage levels: in long chains of narrow 
dependencies, most RDDs probably _aren't_ cached, so adding a simple `if 
StorageLevel = None return Seq.fill(numPartitions)(Nil)` check to 
`getCacheLocs` might be safe / sufficient.
    
    Someone more familiar with StorageLevel / caching semantics should 
double-check this reasoning to make sure that I'm not overlooking any 
corner-cases when RDDs' storage levels change due to unpersist / cache / 
persist calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to