HeartSaVioR commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-648763832
Sorry I've been working through my own stuff and haven't had time to look into details. Still need to find the time block to do the code review, but I'd like to add voice on SS case. > Granted, this context is specific to non-streaming file data sources. I was hopeful to find an equivalent perhaps with Structured Streaming but the closest I found was latestFirst and maxFileAge which each have their respective use cases but does not solve this particular one. I agree the combination of `latestFirst` and `maxFileAge` solves a specific use-case, but in general I feel it has a strong disadvantage - source can read the older files than what it read before, so a concept of "offset" cannot be applied and there's no easy way to filter out already processed files (force to remember all processed files) - and wish to replace with some alternative which has consistent semantic. Initially starting from specific timestamp makes more sense than the combination. This is the behavior other data sources support, or end users have been desired to have. SS specific behavior would need to be applied (see how start offset in Kafka data source is applied in both batch and streaming) but we have a good reference (Kafka data source) so not that hard to be implemented correctly. It doesn't mean this PR should implement on the SS part - I'd just like to see consistent options between batch and streaming, so it would be nice to also think about how the option should work in SS. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
