[GitHub] [spark] HeartSaVioR opened a new pull request #30411: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

GitBox Wed, 18 Nov 2020 04:44:24 -0800


HeartSaVioR opened a new pull request #30411:
URL: https://github.com/apache/spark/pull/30411

### What changes were proposed in this pull request?

Two new options, _modifiiedBefore_ and _modifiedAfter_, is provided
expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. _PartioningAwareFileIndex_
considers these options during the process of checking for files, just before
considering applied _PathFilters_ such as `pathGlobFilter.` In order to filter
file results, a new PathFilter class was derived for this purpose. General
house-keeping around classes extending PathFilter was performed for neatness.
It became apparent support was needed to handle multiple potential path
filters. Logic was introduced for this purpose and the associated tests
written.

### Why are the changes needed?

When loading files from a data source, there can often times be thousands of
file within a respective file path. In many cases I've seen, we want to start
loading from a folder path and ideally be able to begin loading files having
modification dates past a certain point. This would mean out of thousands of
potential files, only the ones with modification dates greater than the
specified timestamp would be considered. This saves a ton of time
automatically and reduces significant complexity managing this in code.

### Does this PR introduce _any_ user-facing change?

This PR introduces an option that can be used with batch-based Spark file
data sources. A documentation update was made to reflect an example and usage
of the new data source option.

**Example Usages**
_Load all CSV files modified after date:_

`spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()`

_Load all CSV files modified before date:_

`spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()`

_Load all CSV files modified between two dates:_

`spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()

### How was this patch tested?

A handful of unit tests were added to support the positive, negative, and
edge case code paths. It's also live in a handful of our Databricks dev
environments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR opened a new pull request #30411: [SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source

Reply via email to