HeartSaVioR opened a new pull request #23840: [SPARK-24295][SS] Add option to 
retain only last batch in file stream sink metadata
URL: https://github.com/apache/spark/pull/23840
 
 
   ## What changes were proposed in this pull request?
   
   This patch proposes adding option in file stream sink to retain only the 
last batch for file log (metadata). This would help on the case where query is 
outputting plenty of files per each batch, which compacting metadata files into 
one could bring non-trivial overhead.
   
   Please refer [the comment in JIRA 
issue](https://issues.apache.org/jira/browse/SPARK-24295?focusedCommentId=16545577&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16545577)
 for more details on the overhead current file stream sink metadata and  file 
stream source metadata file index can bring to high-volume and long-run queries.
   
   As this patch purges old batches and retains only last batch in metadata, 
metadata file index fails to construct list of files when we enable this 
option, and as a result file (stream) source cannot read the output directory. 
To re-enable reading from the output directory, this patch also proposes to add 
option in file (stream) source which ignores metadata information when reading 
directory. With this option, end users can also choose the faster one between 
in-memory file index and metadata file index when metadata file gets much 
bigger.
   
   ## How was this patch tested?
   
   Added unit tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to