HeartSaVioR commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-648763832


   Sorry I've been working through my own stuff and haven't had time to look 
into details. Still need to find the time block to do the code review, but I'd 
like to add voice on SS case.
   
   > Granted, this context is specific to non-streaming file data sources. I 
was hopeful to find an equivalent perhaps with Structured Streaming but the 
closest I found was latestFirst and maxFileAge which each have their respective 
use cases but does not solve this particular one.
   
   I agree the combination of `latestFirst` and `maxFileAge` solves a specific 
use-case, but in general I feel it has a strong disadvantage - source can read 
the older files than what it read before, so a concept of "offset" cannot be 
applied and there's no easy way to filter out already processed files (force to 
remember all processed files) - and wish to replace with some alternative which 
has consistent semantic. 
   
   Initially starting from specific timestamp makes more sense than the 
combination. This is the behavior other data sources support, or end users have 
been desired to have. SS specific behavior would need to be applied (see how 
start offset in Kafka data source is applied in both batch and streaming) but 
we have a good reference (Kafka data source) so not that hard to be implemented 
correctly.
   
   It doesn't mean this PR should implement on the SS part - I'd just like to 
see consistent options between batch and streaming, so it would be nice to also 
think about how the option should work in SS.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to