maropu commented on a change in pull request #28841: URL: https://github.com/apache/spark/pull/28841#discussion_r472591099
########## File path: docs/sql-data-sources-generic-options.md ########## @@ -119,3 +119,48 @@ To load all files recursively, you can use: {% include_example recursive_file_lookup r/RSparkSQLExample.R %} </div> </div> + +### Modification Time Path Filters +`modifiedBefore` and `modifiedAfter` are options that can be +applied together or separately in order to achieve greater +granularity over which files may load during a Spark batch query. + +When the `timeZone` option is present, modified timestamps will be +interpreted according to the specified zone. When a timezone option +is not provided, modified timestamps will be interpreted according +to the default zone specified within the Spark configuration. Without +any timezone configuration, modified timestamps are interpreted as UTC. + +`modifiedBefore` will only allow files having last modified +timestamps occurring before the specified time to load. For example, +when`modifiedBefore` has the timestamp `2020-06-01T12:00:00` applied, +all files modified after that time will not be considered when loading +from a file data source. + +`modifiedAfter` only allows files having last modified timestamps +occurring after the specified timestamp. For example, when`modifiedAfter` +has the timestamp `2020-06-01T12:00:00` applied, only files modified after +this time will be eligible when loading from a file data source. When both +`modifiedBefore` and `modifiedAfter` are specified together, files having +last modified timestamps within the resulting time range are the only files +allowed to load. Review comment: How about rephrasing it like this? (IMHO its okay to use the same description with the other docs in codebase.); ``` * `modifiedBefore`: an optional timestamp to only include files with modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) * `modifiedAfter`: an optional timestamp to only include files with modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) Note that, when the `timeZone` option is present, `modifiedBefore`/`modifiedAfter` will be interpreted according to the specified zone. When a timezone option is not provided, the timestamps will be interpreted according to the default zone specified within the Spark configuration. Without any timezone configuration, modified timestamps are interpreted as UTC. ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org