[ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:
----------------------------------------
    Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .format("csv")
     .load("/mnt/Deltas")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .option("createdFileTime", "2020-05-01 00:00:00")
     .format("csv")
     .load("/mnt/Deltas")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have a created been created at or later than the 
specified time in order to be consumed for purposes of reading the files in 
general or for purposes of structured streaming.

 

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical delta files in CSV format.  When I start reading from 
a folder, however, I might only care about files were created after a certain 
time.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .format("csv")
     .load("/mnt/Deltas/")
{code}
 

In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
 there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
in-memory index of files for a given path.  There may a rather clean 
opportunity to consider options here.

Having the ability to provide an option specifying a timestamp by which to 
begin globbing files would result in quite a bit of less complexity needed on a 
consumer who leverages the ability to stream from a folder path but does not 
have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "createdFileTime" accepting a UTC datetime like below.
{code:java}
spark.readStream
     .option("header", "true")
     .option("delimiter", "\t")
     .option("createdFileTime", "2020-05-01 00:00:00")
     .format("csv")
     .load("/mnt/Deltas/")
{code}
 

If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have a created been created at or later than the 
specified time in order to be consumed for purposes of reading the files in 
general or for purposes of structured streaming.

 


> Provide option to load files after a specified date when reading from a 
> folder path
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-31962
>                 URL: https://issues.apache.org/jira/browse/SPARK-31962
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: Christopher Highman
>            Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical delta files in CSV format.  When I start 
> reading from a folder, however, I might only care about files were created 
> after a certain time.
> {code:java}
> spark.readStream
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .format("csv")
>      .load("/mnt/Deltas")
> {code}
>  
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala],
>  there is a method, _checkAndGlobPathIfNecessary,_ which appears create an 
> in-memory index of files for a given path.  There may a rather clean 
> opportunity to consider options here.
> Having the ability to provide an option specifying a timestamp by which to 
> begin globbing files would result in quite a bit of less complexity needed on 
> a consumer who leverages the ability to stream from a folder path but does 
> not have an interest in reading what could be thousands of files that are not 
> relevant.
> One example to could be "createdFileTime" accepting a UTC datetime like below.
> {code:java}
> spark.readStream
>      .option("header", "true")
>      .option("delimiter", "\t")
>      .option("createdFileTime", "2020-05-01 00:00:00")
>      .format("csv")
>      .load("/mnt/Deltas")
> {code}
>  
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have a created been created at or later than 
> the specified time in order to be consumed for purposes of reading the files 
> in general or for purposes of structured streaming.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to