GabeChurch removed a comment on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-1058840240
@dongjoon-hyun thank you! This is golden.
I know you are active on apache orc project so if you have any additional
wisdom to share please do.
I've been testing with Spark 3.2 major and 3.3 fork on Kubernetes (couple
TB writes) on orc table for awhile now and seeing significantly better
performance when enabling the magic s3 committer. Probably worth noting that
I'm partitioning, bucketing (1 col), and sorting on write. Table is fairly read
heavy and orc higher compression is also providing better query performance on
s3 for this use case.
One thing for others, make especially sure you are avoiding small files.
Write performance can/will quickly tank if you don't optimize and with the
magic s3 committer this seems to be somewhat pronounced.
Property | Option
-- | --
spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled | true
spark.hadoop.fs.s3a.committer.magic.enabled | true
spark.hadoop.fs.s3a.committer.name | magic
spark.hadoop.fs.s3a.experimental.input.fadvise | random
spark.hadoop.fs.s3a.impl | org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.readahead.range | 157810688
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version | 2
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a |
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.sql.hive.metastorePartitionPruning | True
spark.sql.orc.filterPushdown | True
spark.sql.parquet.output.committer.class |
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass |
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]