HeartSaVioR opened a new pull request #30411:
URL: https://github.com/apache/spark/pull/30411


   ### What changes were proposed in this pull request?
   
   Two new options, _modifiiedBefore_  and _modifiedAfter_, is provided 
expecting a value in 'YYYY-MM-DDTHH:mm:ss' format.  _PartioningAwareFileIndex_ 
considers these options during the process of checking for files, just before 
considering applied _PathFilters_ such as `pathGlobFilter.`  In order to filter 
file results, a new PathFilter class was derived for this purpose.  General 
house-keeping around classes extending PathFilter was performed for neatness.  
It became apparent support was needed to handle multiple potential path 
filters.  Logic was introduced for this purpose and the associated tests 
written.  
   
   ### Why are the changes needed?
   
   When loading files from a data source, there can often times be thousands of 
file within a respective file path.  In many cases I've seen, we want to start 
loading from a folder path and ideally be able to begin loading files having 
modification dates past a certain point.  This would mean out of thousands of 
potential files, only the ones with modification dates greater than the 
specified timestamp would be considered.  This saves a ton of time 
automatically and reduces significant complexity managing this in code.
   
   ### Does this PR introduce _any_ user-facing change?
   
   This PR introduces an option that can be used with batch-based Spark file 
data sources.  A documentation update was made to reflect an example and usage 
of the new data source option.
   
   **Example Usages**  
   _Load all CSV files modified after date:_ 
   
`spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()`
   
   _Load all CSV files modified before date:_
   
`spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()`
  
   
   _Load all CSV files modified between two dates:_ 
   
`spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load()
  
   `
   
   ### How was this patch tested?
   
   A handful of unit tests were added to support the positive, negative, and 
edge case code paths. It's also live in a handful of our Databricks dev 
environments.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to