HeartSaVioR opened a new pull request #27649:
URL: https://github.com/apache/spark/pull/27649


   ### What changes were proposed in this pull request?
   
   This patch addresses the case where compact metadata file is read twice in 
FileStreamSource during restarting query.
   
   When restarting the query, there is a case which the query starts from 
compaction batch, and the batch has source metadata file to read. One case is 
that the previous query succeeded to read from inputs, but not finalized the 
batch for various reasons.
   
   The patch finds the latest compaction batch when restoring from metadata 
log, and put entries for the batch into the file entry cache which would avoid 
reading compact batch file twice.
   
   FileStreamSourceLog doesn't know about offset / commit metadata in 
checkpoint so doesn't know which exactly batch to start from, but in practice, 
only couple of latest batches are candidates to
   be started from when restarting query. This patch leverages the fact to skip 
calculation if possible.
   
   ### Why are the changes needed?
   
   Spark incurs unnecessary cost on reading the compact metadata file twice on 
some case, which may not be ignorable when the query has been processed huge 
number of files so far.
   
   ### Does this PR introduce any user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New UT.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to