buska88 commented on PR #3090:
URL: https://github.com/apache/celeborn/pull/3090#issuecomment-2854107945

   ```
   25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: 
Starting task 3928.1 in stage 23.0 (TID 29935, zw06-data-hdp-dn29550.mt, 
executor 5432, partition 3928, PROCESS_LOCAL, 8328 bytes)
   25/04/29 21:42:07 WARN task-result-getter-3 TaskSetManager: Lost task 1434.0 
in stage 23.0 (TID 26608, zw06-data-hdp-dn29550.mt, executor 5432): 
java.lang.OutOfMemoryError: GC overhead limit exceeded
   
   25/04/29 21:42:07 INFO task-result-getter-3 TaskSetManager: Handle failed 
task, add task to pendingTasks, task 1434.0 in stage 23.0 (TID 26608, 
zw06-data-hdp-dn29550.mt, executor 5432)
   25/04/29 21:42:07 INFO Reporter ApplicationMaster: AppMaster: 
targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
   25/04/29 21:42:07 INFO dispatcher-BlockManagerMaster 
BlockManagerMasterEndpoint: Registering block manager 
zw06-data-hdp-dn27371.mt:26633 with 2004.6 MiB RAM, BlockManagerId(6004, 
zw06-data-hdp-dn27371.mt, 26633, None)
   25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: 
Starting task 1434.1 in stage 23.0 (TID 29936, zw06-data-hdp-dn27371.mt, 
executor 6004, partition 1434, PROCESS_LOCAL, 8328 bytes)
   25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: 
Starting task 3929.1 in stage 23.0 (TID 29937, zw06-data-hdp-dn27371.mt, 
executor 6004, partition 3929, PROCESS_LOCAL, 8328 bytes)
   25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: 
Starting task 3927.1 in stage 23.0 (TID 29938, zw06-data-hdp-dn27371.mt, 
executor 6004, partition 3927, PROCESS_LOCAL, 8328 bytes)
   25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler 
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5432.
   25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: 
targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
   25/04/29 21:42:08 INFO dag-scheduler-event-loop DAGScheduler: Executor lost: 
5432 (epoch 6)
   25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster 
BlockManagerMasterEndpoint: Trying to remove executor 5432 from 
BlockManagerMaster.
   25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster 
BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5432, 
zw06-data-hdp-dn29550.mt, 17839, None)
   25/04/29 21:42:08 INFO dag-scheduler-event-loop BlockManagerMaster: Removed 
5432 successfully in removeExecutor
   25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler 
YarnSchedulerBackend$YarnDriverEndpoint: Registered executor 
NettyRpcEndpointRef(spark-client://Executor) (10.119.69.11:58434) with ID 5996
   25/04/29 21:42:08 INFO spark-listener-group-executorManagement 
ExecutorMonitor: New executor 5996 has registered (new total is 1367)
   25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: 
targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
   25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster 
BlockManagerMasterEndpoint: Registering block manager 
zw06-data-hdp-dn29769.mt:22001 with 2004.6 MiB RAM, BlockManagerId(5996, 
zw06-data-hdp-dn29769.mt, 22001, None)
   25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: 
Starting task 135.1 in stage 23.0 (TID 29939, zw06-data-hdp-dn29769.mt, 
executor 5996, partition 135, PROCESS_LOCAL, 8328 bytes)
   25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: 
Starting task 136.1 in stage 23.0 (TID 29940, zw06-data-hdp-dn29769.mt, 
executor 5996, partition 136, PROCESS_LOCAL, 8328 bytes)
   25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: 
Starting task 137.1 in stage 23.0 (TID 29941, zw06-data-hdp-dn29769.mt, 
executor 5996, partition 137, PROCESS_LOCAL, 8328 bytes)
   25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler 
YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4786.
   25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: 
targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
   25/04/29 21:42:08 INFO dag-scheduler-event-loop DAGScheduler: Executor lost: 
4786 (epoch 6)
   25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster 
BlockManagerMasterEndpoint: Trying to remove executor 4786 from 
BlockManagerMaster.
   25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster 
BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4786, 
zw06-data-hdp-dn33950.mt, 35009, None)
   25/04/29 21:42:08 INFO dag-scheduler-event-loop BlockManagerMaster: Removed 
4786 successfully in removeExecutor
   25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: 
targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 
   25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler 
YarnSchedulerBackend$YarnDriverEndpoint: Registered executor 
NettyRpcEndpointRef(spark-client://Executor) (10.119.150.13:51114) with ID 5991
   25/04/29 21:42:08 INFO spark-listener-group-executorManagement 
ExecutorMonitor: New executor 5991 has registered (new total is 1368)
   25/04/29 21:42:08 ERROR celeborn-dispatcher-110 SparkUtils: Can not get 
TaskSetManager for taskId: 29935
   25/04/29 21:42:08 INFO celeborn-dispatcher-110 LifecycleManager: handle 
fetch failure for appShuffleId 5 shuffleId 5
   ```
   Find a new case for this pr.Task 29935 launch on  executor 5432.When 
executor 5432 lost, SparkUtils cannot get TaskSetManager by taskid, because 
taskSet.removeRunningTask(tid) has been called.So this pr seems useful for this 
case.The task can throw a fetchfail exception and re run the stage


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to