[
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133902#comment-17133902
]
Hyukjin Kwon commented on SPARK-31962:
--------------------------------------
cc [~kabhwan] FYI
> Provide option to load files after a specified date when reading from a
> folder path
> -----------------------------------------------------------------------------------
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
> Issue Type: Improvement
> Components: SQL, Structured Streaming
> Affects Versions: 3.1.0
> Reporter: Christopher Highman
> Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a
> number of occasions where I want to be able to stream from a folder
> containing any number of historical delta files in CSV format. When I start
> reading from a folder, however, I might only care about files were created
> after a certain time.
> {code:java}
> spark.readStream
> .option("header", "true")
> .option("delimiter", "\t")
> .format("csv")
> .load("/mnt/Deltas")
> {code}
>
> In
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
> there is a method, _checkAndGlobPathIfNecessary,_ which appears create an
> in-memory index of files for a given path. There may a rather clean
> opportunity to consider options here.
> Having the ability to provide an option specifying a timestamp by which to
> begin globbing files would result in quite a bit of less complexity needed on
> a consumer who leverages the ability to stream from a folder path but does
> not have an interest in reading what could be thousands of files that are not
> relevant.
> One example to could be "createdFileTime" accepting a UTC datetime like below.
> {code:java}
> spark.readStream
> .option("header", "true")
> .option("delimiter", "\t")
> .option("createdFileTime", "2020-05-01 00:00:00")
> .format("csv")
> .load("/mnt/Deltas")
> {code}
>
> If this option is specified, the expected behavior would be that files within
> the _"/mnt/Deltas/"_ path must have been created at or later than the
> specified time in order to be consumed for purposes of reading the files in
> general or for purposes of structured streaming.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]