bart-samwel commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-644719642


   The option `fileModifiedDate` doesn't say at all that it's a minimum 
modified date. I can imagine use cases for lower bounds, upper bounds, ranges. 
That requires at least two options, e.g. `filesModifiedAfter` and 
`filesModifiedBefore`.
   
   There's also option `pathGlobFilter` which only supports globs, but there as 
well there may be other use cases, e.g. "files with path names 
lexicographically larger than a file name", or "files with names that, after 
parsing, satisfy some interesting condition".
   
   It seems to me that this is asking for some more generic filtering 
functionality. E.g. something like `.fileFilter(lambda)`, where the lambda 
receives an object argument that has not only the path but also things like the 
modification date. That said, specific options may be pushed down into the data 
source (e.g. S3 supports prefix filters and `start-from`), so it would make 
sense to keep things as options when pushdown might be possible.
   
   Based on weighing the options, I would suggest using two options, for min 
and max.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to