[GitHub] spark pull request: [SPARK-3293] yarn's web show "SUCCEEDED" when ...

tgravescs Tue, 23 Sep 2014 19:36:43 -0700

Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2311#discussion_r17951431
  
    --- Diff: 
yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala 
---
    @@ -91,7 +94,11 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments,
             if (sc != null) {
               logInfo("Invoking sc stop from shutdown hook")
               sc.stop()
    -          finish(FinalApplicationStatus.SUCCEEDED)
    +        }
    +
    +        // Shuts down the AM.
    +        if (!finished) {
    --- End diff --
    
    Yeah, unless we want to add more logic about trying to figure out what 
shouldn't be retried, the easy thing is to just retry on any failure.  
Obviously if the machine dies this code won't be running, its more that 
something weird happens causing it to crash or exit badly.  
    
    There are actually some potential issues with rerunning the AM though.  One 
is what we refer to as split brain (one AM losing connection from RM but still 
running so it starts a second AM) and both write to the same output dir and 
cause issues with the output data.  I filed a jira for this to try to handle in 
Spark AM.
    
    The second occurs if the fist run had committed its output and we rerun it 
when shouldn't. 
     The reason we don't want that to happen is to prevent data corruption. 
Many times in MR one job will start once anothers output is committed, so if it 
was to get changed out from under them by a rerun of the AM it could lose data. 
 I'm not sure that same kind of check is as easy with Spark.  
    
    MR handles both of those cases. Obviously if your MR job is writing to some 
other service or using custom fileoutput or has some other side effects its up 
to the user to guarantee that it can be rerun.  
    
    I'm assuming its the users responsibility with Spark since spark can rerun 
tasks/stages on failure. Any input on that Matei?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3293] yarn's web show "SUCCEEDED" when ...

Reply via email to