HeartSaVioR opened a new pull request #27649: URL: https://github.com/apache/spark/pull/27649
### What changes were proposed in this pull request? This patch addresses the case where compact metadata file is read twice in FileStreamSource during restarting query. When restarting the query, there is a case which the query starts from compaction batch, and the batch has source metadata file to read. One case is that the previous query succeeded to read from inputs, but not finalized the batch for various reasons. The patch finds the latest compaction batch when restoring from metadata log, and put entries for the batch into the file entry cache which would avoid reading compact batch file twice. FileStreamSourceLog doesn't know about offset / commit metadata in checkpoint so doesn't know which exactly batch to start from, but in practice, only couple of latest batches are candidates to be started from when restarting query. This patch leverages the fact to skip calculation if possible. ### Why are the changes needed? Spark incurs unnecessary cost on reading the compact metadata file twice on some case, which may not be ignorable when the query has been processed huge number of files so far. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
