gaborgsomogyi commented on a change in pull request #27649: [SPARK-30900][SS]
FileStreamSource: Avoid reading compact metadata log twice if the query
restarts from compact batch
URL: https://github.com/apache/spark/pull/27649#discussion_r398594049
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceLog.scala
##########
@@ -122,8 +123,35 @@ class FileStreamSourceLog(
}
batches
}
+
+ def restore(): Array[FileEntry] = {
+ val files = allFiles()
+
+ // When restarting the query, there is a case which the query starts from
compaction batch,
+ // and the batch has source metadata file to read. One case is that the
previous query
+ // succeeded to read from inputs, but not finalized the batch for various
reasons.
+ // The below code finds the latest compaction batch, and put entries for
the batch into the
+ // file entry cache which would avoid reading compact batch file twice.
+ // It doesn't know about offset / commit metadata in checkpoint so doesn't
know which exactly
+ // batch to start from, but in practice, only couple of latest batches are
candidates to
+ // be started. We leverage the fact to skip calculation if possible.
+ files.lastOption.foreach { lastEntry =>
+ val latestBatchId = lastEntry.batchId
+ val latestCompactedBatchId = getAllValidBatches(latestBatchId,
compactInterval)(0)
+ if (latestCompactedBatchId > 0 &&
+ (latestBatchId - latestCompactedBatchId) <
PREV_NUM_BATCHES_TO_READ_IN_RESTORE) {
Review comment:
Maybe a comment would be good why this heuristic is useful.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]