On 24 Oct 2016, at 20:32, Cheng Lian 
<lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:



On 10/22/16 6:18 AM, Steve Loughran wrote:

...
On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian 
<lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:

What version of Spark are you using and how many output files does the job 
writes out?

By default, Spark versions before 1.6 (not including) writes Parquet summary 
files when committing the job. This process reads footers from all Parquet 
files in the destination directory and merges them together. This can be 
particularly bad if you are appending a small amount of data to a large 
existing Parquet dataset.

If that's the case, you may disable Parquet summary files by setting Hadoop 
configuration " parquet.enable.summary-metadata" to false.


Now I'm a bit mixed up. Should that be 
spark.sql.parquet.enable.summary-metadata =false?
No, "parquet.enable.summary-metadata" is a Hadoop configuration option 
introduced by Parquet. In Spark 2.0, you can simply set it using 
spark.conf.set(), Spark will propagate it properly.


OK, chased it down to  a feature that ryanb @ netflix made optional, presumably 
for their s3 work (PARQUET-107 )

This is what I'm going to say make a good set of options for S3A & Parquet

spark.sql.parquet.filterPushdown true
spark.sql.parquet.mergeSchema false
spark.hadoop.parquet.enable.summary-metadata false

While for ORC, you want


spark.sql.orc.splits.include.file.footer true
spark.sql.orc.cache.stripe.details.size 10000
spark.sql.orc.filterPushdown true

And:

spark.sql.hive.metastorePartitionPruning true

along with commitment via:

spark.speculation false
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped true



For when people get to play with the Hadoop S3A phase II binaries, they'll also 
be wanting

spark.hadoop.fs.s3a.readahead.range 157810688

// faster backward seek for ORC and Parquet input
spark.hadoop.fs.s3a.experimental.input.fadvise random

// PUT blocks in separate threads
spark.hadoop.fs.s3a.fast.output.enabled true


the fadvise one is *really* good when working with ORC/Parquet; without that 
column filtering and predicate pushdown is somewhat crippled.

Reply via email to