[GitHub] [spark] gaborgsomogyi commented on pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

GitBox Tue, 16 Jun 2020 06:16:18 -0700


gaborgsomogyi commented on pull request #28422:
URL: https://github.com/apache/spark/pull/28422#issuecomment-644757113



   I agree, confusion comes from `latestFirst` basically.
   > But then should we really open the possibility to trace back older files?
   
   I see a use-case where it's useful. The query is has fallen behind and files 
have piled up. The query must keep-up with the incoming data but also must 
process older files as a side job.
   
   > Would we just simply do the thing we do with Kafka's "latest" option, 
which only affects the first batch and no-op in further batches?
   
   Not sure how exactly `latestFirst` should behave then?! Create a single 
gigantic micro-batch which processes all the data and then switch back to 
normal mode?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gaborgsomogyi commented on pull request #28422: [SPARK-17604][SS] FileStreamSource: provide a new option to have retention on input files

Reply via email to