Thanks for the repo, Ryan! I had heard that Netflix had a committer that used the local filesystem as a temporary store, but I wasn't able to find that anywhere until now. I implemented something similar that writes to HDFS and then copies to S3, but it doesn't use the multipart upload API, so I'm sure yours will be faster. I think this is the best thing until S3Guard comes out.
As far as my UUID-tracking approach goes, I was under the impression that a given task would write the same set of files on each attempt. Thus, if the task fails, either the whole job is aborted and the files are removed, or the task is retried and the files are overwritten. On the other and, I can see how having partially-written data visible to readers immediately could cause problems, and that is a good reason to avoid my approach. Steve -- that design document was a very enlightening read. I will be interested in following and possibly contributing to S3Guard in the future. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-tp21033p21041.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org