Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17297
  
    @kayousterhout - I understand your concern and I agree that canceling the 
running tasks is definitely a simpler approach, but this is very inefficient 
for large jobs where tasks can run for hours.  In our environment where fetch 
failures are common, this change will not only improve the performance of the 
jobs in case of fetch failure, this also helps reliability. If we cancel all 
running reducers, we might end of in a state where jobs will not make any 
progress at all in case of frequent fetch failure, because they will just 
flip-flop between two stage.  
    
    Comparing this approach to how Hadoop handles fetch failure, it does not 
fail any reducer in case it detects any map output missing. The reducers just 
continue processing output from other mappers while the missing output is being 
recomputed concurrently. This approach give Hadoop a big edge over Spark for 
long running jobs with multiple fetch failure. This change is one step towards 
making Spark robust against fetch failure, we would eventually want to have the 
hadoop model, where we would not fail any task in case of map output missing.
    
    Regarding the approach,  please let me know if you can think of some way to 
reduce the complexity of this change.
    
    cc -@markhamstra, @rxin, @sameeragarwal  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to