Jungtaek Lim created SPARK-30915:
------------------------------------
Summary: FileStreamSinkLog: Avoid reading the metadata log file
when finding the latest batch ID
Key: SPARK-30915
URL: https://issues.apache.org/jira/browse/SPARK-30915
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim
FileStreamSink.addBatch checks the latest batch ID before writing outputs to
skip writing batch if the batch was committed before.
While it's valid to compare the current batch with the latest batch ID,
getLatest() method is designed to return both the batch ID as well as content
which denotes that the latest metadata log file is being read and deserialized.
This would introduces heavy latency when the latest batch is a compacted batch.
We could just find the metadata log file for latest batch ID, and only do the
minimal check without reading content.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]