cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-650899177


   @HeartSaVioR 
   
   It's in effect no different than a path globular filter except that instead 
instead of my wildcard specifying a file extension, it's a wildcard on other 
metadata, the modified date.  `pathGlobFilter` doesn't use offset-based 
semantics.
   
   What it sounds like, though, is the ability to use a timestamp so that you 
can replay some segment of an event sourced stream that's acting as an 
append-only transaction log.  This would allow much better control of playing 
back streaming data from files.  I believe that would be an awesome feature but 
not what this is trying to achieve.
   
   Here's a clear example of the difference: suppose I'm reading from a folder 
path having files from 2008.  If I were using offset by timestamp, the 
timestamp may refer to a point in time when I had consumed a particular file 
with no context to when the file itself was modified last.  So, this would mean 
if my goal was to only begin streaming with files in the path that began after 
2019, I'd still be consuming older files.
   
   Let me know if my train of thought here is off, I appreciate your patience.
   
   @gengliangwang for comment as the current implementation followed guidance 
for `pathGlobFilter`.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to