taiyang-li commented on issue #5823:
URL: 
https://github.com/apache/incubator-gluten/issues/5823#issuecomment-2124003903

   问题原因:
   
   1. spark推测执行时,当前task由于"another attempt succeeded"而被driver kill
   ```
   2024/05/22 14:33:47.244 INFO [dispatcher-Executor] spark.executor.Executor: 
Executor is trying to kill task 91.0 in stage 6.0 (TID 2398), reason: another 
attempt succeeded
    ``` 
   
   
![image](https://github.com/apache/incubator-gluten/assets/8181003/fa52d76b-924b-408a-a013-8c46b3cb5028)
    
   2. java 
thread当前正在执行ShuffleClientImpl.doPushMergeData,由于kill触发了java.lang.InterruptedException异常,该函数catch住异常并生成新的org.apache.celeborn.common.exception.CelebornIOException异常
   
   ``` java
       // do push merged data
       try {
         if (!isPushTargetWorkerExcluded(batches.get(0).loc, wrappedCallback)) {
           if (!testRetryRevive || remainReviveTimes < 1) {
             assert dataClientFactory != null;
             TransportClient client = dataClientFactory.createClient(host, 
port);
             client.pushMergedData(mergedData, pushDataTimeout, 
wrappedCallback);
           } else {
             wrappedCallback.onFailure(
                 new CelebornIOException(
                     StatusCode.PUSH_DATA_FAIL_NON_CRITICAL_CAUSE_PRIMARY,
                     new RuntimeException("Mock push merge data failed.")));
           }
         }
       } catch (Exception e) {
         logger.error(
             "Exception raised while pushing merged data for shuffle {} map {} 
attempt {} partition {} groupedBatch {} batch {} location {}.",
             shuffleId,
             mapId,
             attemptId,
             Arrays.toString(partitionIds),
             groupedBatchId,
             Arrays.toString(batchIds),
             addressPair,
             e);
         wrappedCallback.onFailure(
             new 
CelebornIOException(StatusCode.PUSH_DATA_CREATE_CONNECTION_FAIL_PRIMARY, e));
       }
   ```
   
   3. 此时clickhouse backend中c++代码catch住java.lang.InterruptedException异常,并重新throw 
DB::Exception异常。当前线程在此终止。task被标记为FAILED,这是不符合预期的,正确的task状态应当是KILLED.
   
   ```
   libc++abi: terminating due to uncaught exception of type DB::Exception: 
org.apache.celeborn.common.exception.CelebornIOException: Register shuffle 
failed for shuffle 6.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to