GabeChurch removed a comment on pull request #32518:
URL: https://github.com/apache/spark/pull/32518#issuecomment-1058840240


    @dongjoon-hyun thank you! This is golden. 
    
    I know you are active on apache orc project so if you have any additional 
wisdom to share please do. 
    I've been testing with Spark 3.2 major and 3.3 fork on Kubernetes (couple 
TB writes) on orc table for awhile now and seeing significantly better 
performance when enabling the magic s3 committer. Probably worth noting that 
I'm partitioning, bucketing (1 col), and sorting on write. Table is fairly read 
heavy and orc higher compression is also providing better query performance on 
s3 for this use case. 
    
    One thing for others, make especially sure you are avoiding small files. 
Write performance can/will quickly tank if you don't optimize and with the 
magic s3 committer this seems to be somewhat pronounced. 
   
   Property | Option
   -- | --
   spark.hadoop.fs.s3a.bucket.all.committer.magic.enabled | true
   spark.hadoop.fs.s3a.committer.magic.enabled | true
   spark.hadoop.fs.s3a.committer.name | magic
   spark.hadoop.fs.s3a.experimental.input.fadvise | random
   spark.hadoop.fs.s3a.impl | org.apache.hadoop.fs.s3a.S3AFileSystem
   spark.hadoop.fs.s3a.readahead.range | 157810688
   spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version | 2
   spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a | 
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
   spark.sql.hive.metastorePartitionPruning | True
   spark.sql.orc.filterPushdown | True
   spark.sql.parquet.output.committer.class | 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
   spark.sql.sources.commitProtocolClass | 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to