On 24 Oct 2016, at 20:32, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:
On 10/22/16 6:18 AM, Steve Loughran wrote: ... On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote: What version of Spark are you using and how many output files does the job writes out? By default, Spark versions before 1.6 (not including) writes Parquet summary files when committing the job. This process reads footers from all Parquet files in the destination directory and merges them together. This can be particularly bad if you are appending a small amount of data to a large existing Parquet dataset. If that's the case, you may disable Parquet summary files by setting Hadoop configuration " parquet.enable.summary-metadata" to false. Now I'm a bit mixed up. Should that be spark.sql.parquet.enable.summary-metadata =false? No, "parquet.enable.summary-metadata" is a Hadoop configuration option introduced by Parquet. In Spark 2.0, you can simply set it using spark.conf.set(), Spark will propagate it properly. OK, chased it down to a feature that ryanb @ netflix made optional, presumably for their s3 work (PARQUET-107 ) This is what I'm going to say make a good set of options for S3A & Parquet spark.sql.parquet.filterPushdown true spark.sql.parquet.mergeSchema false spark.hadoop.parquet.enable.summary-metadata false While for ORC, you want spark.sql.orc.splits.include.file.footer true spark.sql.orc.cache.stripe.details.size 10000 spark.sql.orc.filterPushdown true And: spark.sql.hive.metastorePartitionPruning true along with commitment via: spark.speculation false spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true For when people get to play with the Hadoop S3A phase II binaries, they'll also be wanting spark.hadoop.fs.s3a.readahead.range 157810688 // faster backward seek for ORC and Parquet input spark.hadoop.fs.s3a.experimental.input.fadvise random // PUT blocks in separate threads spark.hadoop.fs.s3a.fast.output.enabled true the fadvise one is *really* good when working with ORC/Parquet; without that column filtering and predicate pushdown is somewhat crippled.