I'm using Spark 1.5.2 and trying to append a data frame to partitioned Parquet directory in S3. It is known that the default `ParquetOutputCommitter` performs poorly in S3 because move is implemented as copy/delete, but the `DirectParquetOutputCommitter` is not safe to use for append operations in case of failure. I'm not very familiar with the intricacies of job/task committing/aborting, but I've written a rough replacement output committer that seems to work. It writes the results directly to their final locations and uses the write UUID to determine which files to remove in the case of a job/task abort. It seems to be a workable concept in the simple tests that I've tried. However, I can't make Spark use this alternate output committer because the changes in SPARK-8578 categorically prohibit any custom output committer from being used, even if it's safe for appending. I have two questions: 1) Does anyone more familiar with output committing have any feedback on my proposed "safe" append strategy, and 2) is there any way to circumvent the restriction on append committers without editing and recompiling Spark? Discussion of solutions in Spark 2.1 is also welcome.
-- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-tp21033.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org