Thanks Cheng!
I'm running 1.5. After setting the following, I'm no longer seeing this
issue:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Thanks,
-Matt
On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian wrote:
> This is probably caused by schema
Thanks for the feedback, Matt!
Yes, we've also seen other feedback about slow Parquet summary file
generation, especially when appending a small dataset to an existing
large dataset. Disabling it is a reasonable workaround since the summary
files are no longer important after parquet-mr 1.7.
This is probably caused by schema merging. Were you using Spark 1.4 or
earlier versions? Could you please try the following snippet to see
whether it helps:
df.write
.format("parquet")
.option("mergeSchema", "false")
.partitionBy(partitionCols: _*)
.mode(saveMode)
.save(targetPath)