[GitHub] [spark] HeartSaVioR edited a comment on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

GitBox Sun, 29 Nov 2020 14:57:57 -0800


HeartSaVioR edited a comment on pull request #28363:
URL: https://github.com/apache/spark/pull/28363#issuecomment-735469211



   Btw, I also concern (probably more concerning) on metadata log growing in 
FileStreamSource.(#28422)
   
   The format of each entry in FileStreamSource is much smaller than 
FileStreamSink's one so it's more resilient to the memory issue, but while 
there're 3rd party alternatives on FileStreamSink (as we all know), there're no 
alternative on FileStreamSource to read from files. That said users are forced 
to introduce external process to have less files in order to give less pressure 
to the metadata log in FileStreamSource, or use other data sources for the 
input of SS.
   
   Unlike FileStreamSink, it's not that simple to remove log entry, just 
because we support `latestFirst`. We didn't need to consider such option in 
SPARK-20568 (#22952), but we'll never be able to have a threshold to remove log 
entry if we keep supporting `latestFirst`, as the meaning of `latest` keeps 
changing and it's reading backward without lower bound (`maxFileAge` is 
ignored), hence "any" files could be read, even ancient one, in later batch.
   
   I've also raised the discussion thread but didn't get any committers' voice.
   
   
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-quot-latestFirst-quot-option-and-metadata-growing-issue-in-File-stream-source-td29853.html
   
   Though I see some voices want to see FileStreamSource work just like Kafka 
stream source, which says, replace `latestFirst` with start offset (last 
modified time for the file stream source). That says we do only support forward 
scanning. I think this is the right way to go, unless anyone provides there're 
lots of users leveraging `latestFirst` and their use case is not covered by 
start offset.
   
   WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #28363: [SPARK-27188][SS] FileStreamSink: provide a new option to have retention on output files

Reply via email to