HeartSaVioR opened a new pull request #23840: [SPARK-24295][SS] Add option to retain only last batch in file stream sink metadata URL: https://github.com/apache/spark/pull/23840 ## What changes were proposed in this pull request? This patch proposes adding option in file stream sink to retain only the last batch for file log (metadata). This would help on the case where query is outputting plenty of files per each batch, which compacting metadata files into one could bring non-trivial overhead. Please refer [the comment in JIRA issue](https://issues.apache.org/jira/browse/SPARK-24295?focusedCommentId=16545577&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16545577) for more details on the overhead current file stream sink metadata and file stream source metadata file index can bring to high-volume and long-run queries. As this patch purges old batches and retains only last batch in metadata, metadata file index fails to construct list of files when we enable this option, and as a result file (stream) source cannot read the output directory. To re-enable reading from the output directory, this patch also proposes to add option in file (stream) source which ignores metadata information when reading directory. With this option, end users can also choose the faster one between in-memory file index and metadata file index when metadata file gets much bigger. ## How was this patch tested? Added unit tests.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
