HeartSaVioR edited a comment on pull request #28363:
URL: https://github.com/apache/spark/pull/28363#issuecomment-735469211


   Btw, I also concern (probably more concerning) on metadata log growing in 
FileStreamSource.(#28422)
   
   The format of each entry in FileStreamSource is much smaller than 
FileStreamSink's one so it's more resilient to the memory issue, but while 
there're 3rd party alternatives on FileStreamSink (as we all know), there're no 
alternative on FileStreamSource to read from files. That said users are forced 
to introduce external process to have less files in order to give less pressure 
to the metadata log in FileStreamSource, or use other data sources for the 
input of SS.
   
   Unlike FileStreamSink, it's not that simple, just because we support 
`latestFirst`. We didn't need to consider such option in SPARK-20568 (#22952), 
but we'll never be able to have a threshold to remove log entry if we keep 
supporting `latestFirst`, as the meaning of `latest` keeps changing and it's 
reading backward, hence "any" files could be read, even ancient one, in later 
batch.
   
   I've also raised the discussion thread but didn't get any committers' voice.
   
   
http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-quot-latestFirst-quot-option-and-metadata-growing-issue-in-File-stream-source-td29853.html
   
   Though I see some voices want to see FileStreamSource work just like Kafka 
stream source, which says, replace `latestFirst` with start offset (last 
modified time for the file stream source). I think this is the right way to go, 
unless anyone provides there're lots of users leveraging `latestFirst` and 
their use case is not covered by start offset.
   
   WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to