vastian180 commented on PR #3341:
URL: https://github.com/apache/celeborn/pull/3341#issuecomment-2991321296

   case 1: During deserialization of the GetReducerFileGroupResponse broadcast, 
ExecutorLostFailure happend because Container was preempted,  leads to 
reporting a fetch failure.
   ```
   25/06/16 08:39:21 INFO Executor task launch worker for task 30724 
SparkUtils: Deserializing GetReducerFileGroupResponse broadcast for shuffle: 0
   25/06/16 08:39:21 INFO Executor task launch worker for task 30724 
TorrentBroadcast: Started reading broadcast variable 7 with 3 pieces (estimated 
total size 12.0 MiB)
   ......
   
   25/06/16 08:39:21 ERROR Executor task launch worker for task 30724 
SparkUtils: Failed to deserialize GetReducerFileGroupResponse for shuffle: 0
   java.io.IOException: org.apache.spark.SparkException: Exception thrown in 
awaitResult: 
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1387)
        at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:226)
        at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:103)
        at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
        at 
org.apache.spark.shuffle.celeborn.SparkUtils.lambda$deserializeGetReducerFileGroupResponse$4(SparkUtils.java:600)
        at org.apache.celeborn.common.util.KeyLock.withLock(KeyLock.scala:65)
        at 
org.apache.spark.shuffle.celeborn.SparkUtils.deserializeGetReducerFileGroupResponse(SparkUtils.java:585)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:485)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader$$anon$5.apply(CelebornShuffleReader.scala:480)
        at 
org.apache.celeborn.client.ShuffleClient.deserializeReducerFileGroupResponse(ShuffleClient.java:321)
        at 
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1876)
        at 
org.apache.celeborn.client.ShuffleClientImpl.lambda$updateFileGroup$9(ShuffleClientImpl.java:1935)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at 
org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1931)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
        at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:225)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.scheduler.Task.run(Task.scala:130)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:477)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1428)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:480)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
   Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:105)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:89)
        at 
org.apache.spark.storage.BlockManagerMaster.getLocationsAndStatus(BlockManagerMaster.scala:93)
        at 
org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:1179)
        at 
org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1341)
        at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBlocks$1(TorrentBroadcast.scala:180)
        at 
scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at 
org.apache.spark.broadcast.TorrentBroadcast.readBlocks(TorrentBroadcast.scala:169)
        at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:253)
        at scala.Option.getOrElse(Option.scala:189)
        at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:231)
        at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
        at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:226)
        at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1380)
        ... 40 more
   Caused by: java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@739bc42 
rejected from 
java.util.concurrent.ScheduledThreadPoolExecutor@66c2c5b0[Terminated, pool size 
= 0, active threads = 0, queued tasks = 0, completed tasks = 0]
        at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
        at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
        at 
org.apache.spark.rpc.netty.NettyRpcEnv.askAbortable(NettyRpcEnv.scala:264)
        at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.askAbortable(NettyRpcEnv.scala:552)
        at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:556)
        at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:104)
        ... 54 more
   
   25/06/16 08:39:21 ERROR Executor task launch worker for task 30723 
ShuffleClientImpl: Exception raised while call GetReducerFileGroup for 0.
   org.apache.celeborn.common.exception.CelebornIOException: Failed to get 
GetReducerFileGroupResponse broadcast for shuffle: 0
        at 
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1878)
   ......
   
   25/06/16 08:39:21 WARN Executor task launch worker for task 30724 
CelebornShuffleReader: Handle fetch exceptions for 0-0
   org.apache.celeborn.common.exception.CelebornIOException: Failed to load 
file group of shuffle 0 partition 4643! Failed to get 
GetReducerFileGroupResponse broadcast for shuffle: 0
        at 
org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1943)
        at 
org.apache.spark.shuffle.celeborn.CelebornShuffleReader.read(CelebornShuffleReader.scala:119)
   ......
   Caused by: org.apache.celeborn.common.exception.CelebornIOException: Failed 
to get GetReducerFileGroupResponse broadcast for shuffle: 0
        at 
org.apache.celeborn.client.ShuffleClientImpl.loadFileGroupInternal(ShuffleClientImpl.java:1878)
        at 
org.apache.celeborn.client.ShuffleClientImpl.lambda$updateFileGroup$9(ShuffleClientImpl.java:1935)
        at 
java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
        at 
org.apache.celeborn.client.ShuffleClientImpl.updateFileGroup(ShuffleClientImpl.java:1931)
        ... 27 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@celeborn.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to