[
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713030#comment-16713030
]
Mihaly Toth commented on SPARK-25331:
-------------------------------------
I have closed my PR. I guess it should be documented that we expect the user to
read only files that have their name written to manifest files.
> Structured Streaming File Sink duplicates records in case of driver failure
> ---------------------------------------------------------------------------
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 2.3.1
> Reporter: Mihaly Toth
> Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has
> been started by {{FileFormatWrite.write}} and then the resulting task sets
> are completed but in the meantime the driver dies. In such a case repeating
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}}
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]