HeartSaVioR commented on code in PR #38517:
URL: https://github.com/apache/spark/pull/38517#discussion_r1052721653


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala:
##########
@@ -342,17 +342,14 @@ class MicroBatchExecution(
         isCurrentBatchConstructed = true
         availableOffsets = nextOffsets.toStreamProgress(sources)
         /* Initialize committed offsets to a committed batch, which at this
-         * is the second latest batch id in the offset log. */
-        if (latestBatchId != 0) {
-          val secondLatestOffsets = offsetLog.get(latestBatchId - 1).getOrElse 
{
-            logError(s"The offset log for batch ${latestBatchId - 1} doesn't 
exist, " +
-              s"which is required to restart the query from the latest batch 
$latestBatchId " +
-              "from the offset log. Please ensure there are two subsequent 
offset logs " +
-              "available for the latest batch via manually deleting the offset 
file(s). " +
-              "Please also ensure the latest batch for commit log is equal or 
one batch " +
-              "earlier than the latest batch for offset log.")
-            throw new IllegalStateException(s"batch ${latestBatchId - 1} 
doesn't exist")
-          }
+         * is the second latest batch id in the offset log.

Review Comment:
   This logic can affect the offset range of microbatch. As you've added the 
test, even without async progress tracking flag on, normal processing trigger 
can technically roll multiple microbatches back, "with composing these offsets 
into one". This breaks the assumption of exactly-once semantic, every 
microbatch should have planned its offset range before execution, and the range 
must not be changed once planned.
   
   This is why async progress tracking cannot work as it is for Delta sink and 
stateful operator. We blocked this for async progress tracking, but 
accidentally exposing this to "normal" processing trigger.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to