bart-samwel commented on pull request #28841: URL: https://github.com/apache/spark/pull/28841#issuecomment-644719642
The option `fileModifiedDate` doesn't say at all that it's a minimum modified date. I can imagine use cases for lower bounds, upper bounds, ranges. That requires at least two options, e.g. `filesModifiedAfter` and `filesModifiedBefore`. There's also option `pathGlobFilter` which only supports globs, but there as well there may be other use cases, e.g. "files with path names lexicographically larger than a file name", or "files with names that, after parsing, satisfy some interesting condition". It seems to me that this is asking for some more generic filtering functionality. E.g. something like `.fileFilter(lambda)`, where the lambda receives an object argument that has not only the path but also things like the modification date. That said, specific options may be pushed down into the data source (e.g. S3 supports prefix filters and `start-from`), so it would make sense to keep things as options when pushdown might be possible. Based on weighing the options, I would suggest using two options, for min and max. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org