Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Matt K
Thanks Cheng! I'm running 1.5. After setting the following, I'm no longer seeing this issue: sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") Thanks, -Matt On Fri, Dec 11, 2015 at 1:58 AM, Cheng Lian wrote: > This is probably caused by schema

Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Cheng Lian
Thanks for the feedback, Matt! Yes, we've also seen other feedback about slow Parquet summary file generation, especially when appending a small dataset to an existing large dataset. Disabling it is a reasonable workaround since the summary files are no longer important after parquet-mr 1.7.

Re: memory leak when saving Parquet files in Spark

2015-12-10 Thread Cheng Lian
This is probably caused by schema merging. Were you using Spark 1.4 or earlier versions? Could you please try the following snippet to see whether it helps: df.write .format("parquet") .option("mergeSchema", "false") .partitionBy(partitionCols: _*) .mode(saveMode) .save(targetPath)