[
https://issues.apache.org/jira/browse/SPARK-30915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-30915:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> FileStreamSinkLog: Avoid reading the metadata log file when finding the
> latest batch ID
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-30915
> URL: https://issues.apache.org/jira/browse/SPARK-30915
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Affects Versions: 3.1.0
> Reporter: Jungtaek Lim
> Priority: Major
>
> FileStreamSink.addBatch checks the latest batch ID before writing outputs to
> skip writing batch if the batch was committed before.
> While it's valid to compare the current batch with the latest batch ID,
> getLatest() method is designed to return both the batch ID as well as content
> which denotes that the latest metadata log file is being read and
> deserialized. This would introduces heavy latency when the latest batch is a
> compacted batch.
> We could just find the metadata log file for latest batch ID, and only do the
> minimal check without reading content.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]