HeartSaVioR edited a comment on pull request #28363: URL: https://github.com/apache/spark/pull/28363#issuecomment-735469211
Btw, I also concern (probably more concerning) on metadata log growing in FileStreamSource.(#28422) The format of each entry in FileStreamSource is much smaller than FileStreamSink's one so it's more resilient to the memory issue, but while there're 3rd party alternatives on FileStreamSink (as we all know), there're no alternative on FileStreamSource to read from files. That said users are forced to introduce external process to have less files in order to give less pressure to the metadata log in FileStreamSource, or use other data sources for the input of SS. Unlike FileStreamSink, it's not that simple to introduce retention, just because we support `latestFirst`. We didn't need to consider such option in SPARK-20568 (#22952), but we'll never be able to have a threshold to remove log entry if we keep supporting `latestFirst`, as the meaning of `latest` keeps changing and it's reading backward without lower bound (`maxFileAge` is ignored), hence "any" files could be read, even ancient one, in later batch. I've also raised the discussion thread but didn't get any committers' voice. http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-quot-latestFirst-quot-option-and-metadata-growing-issue-in-File-stream-source-td29853.html Though I see some voices want to see FileStreamSource work just like Kafka stream source, which says, replace `latestFirst` with start offset (last modified time for the file stream source). I think this is the right way to go, unless anyone provides there're lots of users leveraging `latestFirst` and their use case is not covered by start offset. WDYT? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
