Jeff Steinmetz created SPARK-49051:
--------------------------------------
Summary: Provide modifiedAfter and modifiedBefore options when
filtering from a stream source
Key: SPARK-49051
URL: https://issues.apache.org/jira/browse/SPARK-49051
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 3.4.3
Reporter: Jeff Steinmetz
In the following Jira issue
https://issues.apache.org/jira/browse/SPARK-31962
Two new options, *modifiiedBefore* and *modifiedAfter* for batch reads (for
example, CSV) where introduced, and eventually merged into version 3.1.1 via PR:
https://issues.apache.org/jira/browse/SPARK-31962
This was introduced in a way that batch reads allow these two options, however
a stream is explicitly not allowed.
When loading files from a data source as a stream, there too can be times where
thousands of files are within a respective file path. This applies to both
batch and stream use cases. Note: The Databricks "cloudFiles" AutoLoader
supports these options in a stream.
[https://docs.databricks.com/en/ingestion/auto-loader/options.html#id20]
{{*Suggested Example Usages*}}
{{_Start stream with all CSV files modified after date:_}}
{{spark.readStream.option("modifiedAfter","2020-06-15T05:00:00").option("quote",
'"').option("escape", '"').csv(source_path)}}
{{_Start Stream with all CSV files modified before date:_}}
{{spark.readStream.option("modifiedAfter","2020-06-15T05:00:00").option("quote",
'"').option("escape", '"').csv(source_path)}}
_Start stream with all CSV files modified between two dates:_
{{spark.readStream.option("modifiedAfter","2019-06-15T05:00:00").{{{}option("modifiedBefore","2020-06-15T05:00:00"){}}}option("quote",
'"').option("escape", '"').csv(source_path)}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]