Re: [PR] [CELEBORN-1838] Interrupt spark task should not report fetch failure [celeborn]

via GitHub Tue, 21 Jan 2025 01:44:45 -0800


FMX commented on code in PR #3070:
URL: https://github.com/apache/celeborn/pull/3070#discussion_r1923385508



##########
client-spark/spark-3/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala:
##########
@@ -369,7 +375,22 @@ class CelebornShuffleReader[K, C](
     }
   }
 
-  private def handleFetchExceptions(
+  @VisibleForTesting
+  def checkAndReportFetchFailureForUpdateFileGroupFailure(
+      celebornShuffleId: Int,
+      ce: Throwable): Unit = {
+    if (ce.getCause != null &&
+      (ce.getCause.isInstanceOf[InterruptedException] || 
ce.getCause.isInstanceOf[
+        TimeoutException])) {

Review Comment:
   A timeout Exception might happen if there is something wrong with the driver 
or the executor.
   If the executor has something wrong, it might encounter a timeout exception 
which we will want the task to retry itself.
   If the driver has something wrong, throw fetch failure might not be able to 
save the situation.
   So in the current implementation, we assume that all timeout exceptions 
should be treated as normal exceptions and let the task retry itself.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [CELEBORN-1838] Interrupt spark task should not report fetch failure [celeborn]

Reply via email to