[GitHub] [spark] maropu commented on a change in pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

GitBox Tue, 18 Aug 2020 19:05:53 -0700


maropu commented on a change in pull request #28841:
URL: https://github.com/apache/spark/pull/28841#discussion_r472592405




##########
File path: docs/sql-data-sources-generic-options.md
##########
@@ -119,3 +119,48 @@ To load all files recursively, you can use:
 {% include_example recursive_file_lookup r/RSparkSQLExample.R %}
 </div>
 </div>
+
+### Modification Time Path Filters
+`modifiedBefore` and `modifiedAfter` are options that can be 
+applied together or separately in order to achieve greater
+granularity over which files may load during a Spark batch query.
+
+When the `timeZone` option is present, modified timestamps will be
+interpreted according to the specified zone. When a timezone option
+is not provided, modified timestamps will be interpreted according
+to the default zone specified within the Spark configuration. Without
+any timezone configuration, modified timestamps are interpreted as UTC.
+
+`modifiedBefore` will only allow files having last modified
+timestamps occurring before the specified time to load. For example,
+when`modifiedBefore` has the timestamp `2020-06-01T12:00:00` applied,
+all files modified after that time will not be considered when loading
+from a file data source.
+ 
+`modifiedAfter` only allows files having last modified timestamps
+occurring after the specified timestamp. For example, when`modifiedAfter`
+has the timestamp `2020-06-01T12:00:00` applied, only files modified after 
+this time will be eligible when loading from a file data source. When both
+`modifiedBefore` and `modifiedAfter` are specified together, files having
+last modified timestamps within the resulting time range are the only files
+allowed to load.

Review comment:
       > Note that, when the `timeZone` option is present, 
`modifiedBefore`/`modifiedAfter`
   will be interpreted according to the specified zone. When a timezone option
   is not provided, the timestamps will be interpreted according
   to the default zone specified within the Spark configuration. Without
   any timezone configuration, modified timestamps are interpreted as UTC.
   
   We could say it like this?
   ```
   When a timezone option is not provided, the timestamps will be interpreted 
according
   to the Spark session timezone (`spark.sql.session.timeZone`).
   ```
   
https://github.com/apache/spark/pull/28841/files#diff-2614a5c9164a734b0208806133aa7de9R85-R88




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28841: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

Reply via email to