cchighman commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-650899177
@HeartSaVioR It's in effect no different than a path globular filter except that instead instead of my wildcard specifying a file extension, it's a wildcard on other metadata, the modified date. `pathGlobFilter` doesn't use offset-based semantics. What it sounds like, though, is the ability to use a timestamp so that you can replay some segment of an event sourced stream that's acting as an append-only transaction log. This would allow much better control of playing back streaming data from files. I believe that would be an awesome feature but not what this is trying to achieve. Here's a clear example of the difference: suppose I'm reading from a folder path having files from 2008. If I were using offset by timestamp, the timestamp may refer to a point in time when I had consumed a particular file with no context to when the file itself was modified last. So, this would mean if my goal was to only begin streaming with files in the path that began after 2019, I'd still be consuming older files. Let me know if my train of thought here is off, I appreciate your patience. @gengliangwang for comment as the current implementation followed guidance for `pathGlobFilter`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
