GitHub user sarutak opened a pull request:

    https://github.com/apache/spark/pull/1383

    [SPARK-1667] Jobs never finish successfully once bucket file missing 
occurred

    If jobs execute shuffle, bucket files are created in a temporary directory 
(named like spark-local-*).
    When the bucket files are missing cased by disk failure or any reasons, 
jobs cannot execute shuffle which has same shuffle id for the bucket files.
    
    I think when Executors cannot read bucket files from their local directory 
(spark-local-*), they should abort and marked as lost.
    In this case, Executor which has bucket files throw FileNotFoundException, 
so I think, we should handle IOException as fatal in Utils.scala to abort.
    
    After I modified the code as follows, an Executor which fetches bucket 
files from failed Executor could retry to fetch from another Executor.
    
    
        def isFatalError(e: Throwable): Boolean = {                             
                                                                                
    
          e match {                                                             
                                                                                
    
            case _: IOException =>                                              
                                                                                
    
              true                                                              
                                                                                
    
            case NonFatal(_) | _: InterruptedException | _: NotImplementedError 
| _: ControlThrowable =>                                                        
    
              false                                                             
                                                                                
    
            case _ =>                                                           
                                                                                
    
              true                                                              
                                                                                
    
          }                                                                     
                                                                                
      
        }  


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sarutak/spark SPARK-1667

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1383
    
----
commit 1fc1ba05d21d9732e8c80273d13792a2879c8f1f
Author: Kousuke Saruta <[email protected]>
Date:   2014-07-12T01:08:26Z

    Modify for SPARK-1667

commit c2044d6a53e7767b89c46f92ad5cec4635602bac
Author: Kousuke Saruta <[email protected]>
Date:   2014-07-12T02:31:55Z

    Modified Utils.scala to handle IOException as fatal

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to