buska88 commented on PR #3090: URL: https://github.com/apache/celeborn/pull/3090#issuecomment-2854107945
``` 25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 3928.1 in stage 23.0 (TID 29935, zw06-data-hdp-dn29550.mt, executor 5432, partition 3928, PROCESS_LOCAL, 8328 bytes) 25/04/29 21:42:07 WARN task-result-getter-3 TaskSetManager: Lost task 1434.0 in stage 23.0 (TID 26608, zw06-data-hdp-dn29550.mt, executor 5432): java.lang.OutOfMemoryError: GC overhead limit exceeded 25/04/29 21:42:07 INFO task-result-getter-3 TaskSetManager: Handle failed task, add task to pendingTasks, task 1434.0 in stage 23.0 (TID 26608, zw06-data-hdp-dn29550.mt, executor 5432) 25/04/29 21:42:07 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 25/04/29 21:42:07 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Registering block manager zw06-data-hdp-dn27371.mt:26633 with 2004.6 MiB RAM, BlockManagerId(6004, zw06-data-hdp-dn27371.mt, 26633, None) 25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 1434.1 in stage 23.0 (TID 29936, zw06-data-hdp-dn27371.mt, executor 6004, partition 1434, PROCESS_LOCAL, 8328 bytes) 25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 3929.1 in stage 23.0 (TID 29937, zw06-data-hdp-dn27371.mt, executor 6004, partition 3929, PROCESS_LOCAL, 8328 bytes) 25/04/29 21:42:07 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 3927.1 in stage 23.0 (TID 29938, zw06-data-hdp-dn27371.mt, executor 6004, partition 3927, PROCESS_LOCAL, 8328 bytes) 25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5432. 25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 25/04/29 21:42:08 INFO dag-scheduler-event-loop DAGScheduler: Executor lost: 5432 (epoch 6) 25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Trying to remove executor 5432 from BlockManagerMaster. 25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5432, zw06-data-hdp-dn29550.mt, 17839, None) 25/04/29 21:42:08 INFO dag-scheduler-event-loop BlockManagerMaster: Removed 5432 successfully in removeExecutor 25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.119.69.11:58434) with ID 5996 25/04/29 21:42:08 INFO spark-listener-group-executorManagement ExecutorMonitor: New executor 5996 has registered (new total is 1367) 25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Registering block manager zw06-data-hdp-dn29769.mt:22001 with 2004.6 MiB RAM, BlockManagerId(5996, zw06-data-hdp-dn29769.mt, 22001, None) 25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 135.1 in stage 23.0 (TID 29939, zw06-data-hdp-dn29769.mt, executor 5996, partition 135, PROCESS_LOCAL, 8328 bytes) 25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 136.1 in stage 23.0 (TID 29940, zw06-data-hdp-dn29769.mt, executor 5996, partition 136, PROCESS_LOCAL, 8328 bytes) 25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler TaskSetManager: Starting task 137.1 in stage 23.0 (TID 29941, zw06-data-hdp-dn29769.mt, executor 5996, partition 137, PROCESS_LOCAL, 8328 bytes) 25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4786. 25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 25/04/29 21:42:08 INFO dag-scheduler-event-loop DAGScheduler: Executor lost: 4786 (epoch 6) 25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Trying to remove executor 4786 from BlockManagerMaster. 25/04/29 21:42:08 INFO dispatcher-BlockManagerMaster BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4786, zw06-data-hdp-dn33950.mt, 35009, None) 25/04/29 21:42:08 INFO dag-scheduler-event-loop BlockManagerMaster: Removed 4786 successfully in removeExecutor 25/04/29 21:42:08 INFO Reporter ApplicationMaster: AppMaster: targetNumExecutors=1400, pendingAllocate=0, runningExecutors=1400. 25/04/29 21:42:08 INFO dispatcher-CoarseGrainedScheduler YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.119.150.13:51114) with ID 5991 25/04/29 21:42:08 INFO spark-listener-group-executorManagement ExecutorMonitor: New executor 5991 has registered (new total is 1368) 25/04/29 21:42:08 ERROR celeborn-dispatcher-110 SparkUtils: Can not get TaskSetManager for taskId: 29935 25/04/29 21:42:08 INFO celeborn-dispatcher-110 LifecycleManager: handle fetch failure for appShuffleId 5 shuffleId 5 ``` Find a new case for this pr.Task 29935 launch on executor 5432.When executor 5432 lost, SparkUtils cannot get TaskSetManager by taskid, because taskSet.removeRunningTask(tid) has been called.So this pr seems useful for this case.The task can throw a fetchfail exception and re run the stage -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
