gaborgsomogyi commented on a change in pull request #27649: [SPARK-30900][SS] 
FileStreamSource: Avoid reading compact metadata log twice if the query 
restarts from compact batch
URL: https://github.com/apache/spark/pull/27649#discussion_r398594049
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala
 ##########
 @@ -122,8 +123,35 @@ class FileStreamSourceLog(
     }
     batches
   }
+
+  def restore(): Array[FileEntry] = {
+    val files = allFiles()
+
+    // When restarting the query, there is a case which the query starts from 
compaction batch,
+    // and the batch has source metadata file to read. One case is that the 
previous query
+    // succeeded to read from inputs, but not finalized the batch for various 
reasons.
+    // The below code finds the latest compaction batch, and put entries for 
the batch into the
+    // file entry cache which would avoid reading compact batch file twice.
+    // It doesn't know about offset / commit metadata in checkpoint so doesn't 
know which exactly
+    // batch to start from, but in practice, only couple of latest batches are 
candidates to
+    // be started. We leverage the fact to skip calculation if possible.
+    files.lastOption.foreach { lastEntry =>
+      val latestBatchId = lastEntry.batchId
+      val latestCompactedBatchId = getAllValidBatches(latestBatchId, 
compactInterval)(0)
+      if (latestCompactedBatchId > 0 &&
+          (latestBatchId - latestCompactedBatchId) < 
PREV_NUM_BATCHES_TO_READ_IN_RESTORE) {
 
 Review comment:
   Maybe a comment would be good why this heuristic is useful.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to