Jungtaek Lim created SPARK-30915:
------------------------------------

             Summary: FileStreamSinkLog: Avoid reading the metadata log file 
when finding the latest batch ID
                 Key: SPARK-30915
                 URL: https://issues.apache.org/jira/browse/SPARK-30915
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim


FileStreamSink.addBatch checks the latest batch ID before writing outputs to 
skip writing batch if the batch was committed before.

While it's valid to compare the current batch with the latest batch ID, 
getLatest() method is designed to return both the batch ID as well as content 
which denotes that the latest metadata log file is being read and deserialized. 
This would introduces heavy latency when the latest batch is a compacted batch.

We could just find the metadata log file for latest batch ID, and only do the 
minimal check without reading content.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to