hkkolodner commented on issue #25795: [WIP][SPARK-29037][Core] Spark gives 
duplicate result when an application was killed
URL: https://github.com/apache/spark/pull/25795#issuecomment-533884881
 
 
   Thanks @steveloughran for mentioning our paper on Stocator.  (This is a 
better reference -- [Stocator: Providing High Performance and Fault Tolerance 
for Apache Spark Over Object 
Storage](https://ieeexplore.ieee.org/document/8411062).)
   
   It is better to characterize Stocator wrt commit rather than abort.   
Stocator only commits.  The last commit wins and any parts that do not belong 
are ignored when the partition is read (and they can also be deleted at that 
time to reclaim space or space can be reclaimed by an independent garbage 
collection process).  This can be achieved by writing a manifest at the time of 
commit that has a list of the parts that belong to the commit, e.g., by 
extending the _SUCCESS file/object.  And then when the partition is read, the 
parts listed in the manifest of the most recent commit are read.  Maybe a 
similar solution would be easier here, rather than trying to ensure only one 
commit at a time.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to