Output Committers for S3

Matthew Schauer Mon, 20 Feb 2017 10:14:20 -0800

I'm using Spark 1.5.2 and trying to append a data frame to partitioned
Parquet directory in S3.  It is known that the default
`ParquetOutputCommitter` performs poorly in S3 because move is implemented
as copy/delete, but the `DirectParquetOutputCommitter` is not safe to use
for append operations in case of failure.  I'm not very familiar with the
intricacies of job/task committing/aborting, but I've written a rough
replacement output committer that seems to work.  It writes the results
directly to their final locations and uses the write UUID to determine which
files to remove in the case of a job/task abort.  It seems to be a workable
concept in the simple tests that I've tried.  However, I can't make Spark
use this alternate output committer because the changes in SPARK-8578
categorically prohibit any custom output committer from being used, even if
it's safe for appending.  I have two questions: 1) Does anyone more familiar
with output committing have any feedback on my proposed "safe" append
strategy, and 2) is there any way to circumvent the restriction on append
committers without editing and recompiling Spark?  Discussion of solutions
in Spark 2.1 is also welcome.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-tp21033.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Output Committers for S3

Reply via email to