[GitHub] [spark] jerrypeng commented on a diff in pull request #38517: [SPARK-39591][SS] Async Progress Tracking

GitBox Mon, 19 Dec 2022 13:08:21 -0800


jerrypeng commented on code in PR #38517:
URL: https://github.com/apache/spark/pull/38517#discussion_r1052636003



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##########
@@ -342,17 +342,14 @@ class MicroBatchExecution(
         isCurrentBatchConstructed = true
         availableOffsets = nextOffsets.toStreamProgress(sources)
         /* Initialize committed offsets to a committed batch, which at this
-         * is the second latest batch id in the offset log. */
-        if (latestBatchId != 0) {
-          val secondLatestOffsets = offsetLog.get(latestBatchId - 1).getOrElse 
{
-            logError(s"The offset log for batch ${latestBatchId - 1} doesn't 
exist, " +
-              s"which is required to restart the query from the latest batch 
$latestBatchId " +
-              "from the offset log. Please ensure there are two subsequent 
offset logs " +
-              "available for the latest batch via manually deleting the offset 
file(s). " +
-              "Please also ensure the latest batch for commit log is equal or 
one batch " +
-              "earlier than the latest batch for offset log.")
-            throw new IllegalStateException(s"batch ${latestBatchId - 1} 
doesn't exist")
-          }
+         * is the second latest batch id in the offset log.

Review Comment:
   This logic is not guaranteeing exactly once behavior.  This logic here is 
merely a sanity check to guard against bugs.  Not having this check is not 
breaking exactly once behavior. 
   
   
   > Wouldn't it be more serious problem than supporting switch? If we really 
want to support switching, can we only support switch for the case when 
checkpoint interval is disabled, so that we don't make change on normal 
microbatch execution which could lead to break on fault tolerance semantic?
   
   I don't quite follow. How would this work? The framework will have to 
somehow remember the settings of the previous run.  We would need to add 
metadata to offsets to determine which offsets were written when async progress 
tracking is used.  We don't have any of this kind of functionality in spark 
today and not convinced it is worth while to implement such a thing for this 
use case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] jerrypeng commented on a diff in pull request #38517: [SPARK-39591][SS] Async Progress Tracking

Reply via email to