hkkolodner commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed URL: https://github.com/apache/spark/pull/25795#issuecomment-533884881 Thanks @steveloughran for mentioning our paper on Stocator. (This is a better reference -- [Stocator: Providing High Performance and Fault Tolerance for Apache Spark Over Object Storage](https://ieeexplore.ieee.org/document/8411062).) It is better to characterize Stocator wrt commit rather than abort. Stocator only commits. The last commit wins and any parts that do not belong are ignored when the partition is read (and they can also be deleted at that time to reclaim space or space can be reclaimed by an independent garbage collection process). This can be achieved by writing a manifest at the time of commit that has a list of the parts that belong to the commit, e.g., by extending the _SUCCESS file/object. And then when the partition is read, the parts listed in the manifest of the most recent commit are read. Maybe a similar solution would be easier here, rather than trying to ensure only one commit at a time.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
