Github user tnachen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4984#discussion_r30004596
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
 ---
    @@ -130,7 +142,7 @@ private[spark] class CoarseMesosSchedulerBackend(
           command.setValue(
             "%s \"%s\" org.apache.spark.executor.CoarseGrainedExecutorBackend"
               .format(prefixEnv, runScript) +
    -        s" --driver-url $driverUrl" +
    +        s" --driver-url $driverURL" +
             s" --executor-id ${offer.getSlaveId.getValue}" +
    --- End diff --
    
    After testing this patch with a real Mesos cluster I got a bunch of errors 
when I kill an executor, start it again and try to do recompute a shuffle job 
such as:
    
    15/05/10 18:15:55 WARN TaskSetManager: Lost task 11.0 in stage 3.0 (TID 53, 
10.70.15.58): FetchFailed(null, shuffleId=0, mapId=-1, reduceId=11, message=
    org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 0
        at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:389)
        at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:386)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
        at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:385)
        at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:172)
        at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
        at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
        at org.apache.spark.scheduler.Task.run(Task.scala:70)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
    )
    
    I hit another similiar error in the past when I worked on this feature, and 
Aaron Davidson told me that for external shuffle service to work every new 
executor that is launched must have a new executor ID and shouldn't use the 
same one as before, therefore in my closed PR (3681) I introduced a new 
sparkExecutorId method that concats slave id and task id to form a new executor 
id.
    I think you need to also make every new executor id that is launched unique 
so the external shuffle client/service interaction works, and not just simply 
use slave id.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to