Re: Control number of parquet generated from JavaSchemaRDD

Michael Armbrust Tue, 25 Nov 2014 17:57:55 -0800

I believe coalesce(..., true) and repartition are the same.  If the input
files are of similar sizes, then coalesce will be cheaper as it introduces a
narrow dependency
<https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf>,
meaning there won't be a shuffle.  However, if there is a lot of skew in
the input file size, then a repartition will ensure that data is shuffled
evenly.


There is currently no way to control the file size other than pick a 'good'
number of partitions.

On Tue, Nov 25, 2014 at 11:30 AM, tridib <tridib.sama...@live.com> wrote:

> Thanks Michael,
> It worked like a charm! I have few more queries:
> 1. Is there a way to control the size of parquet file?
> 2. Which method do you recommend coalesce(n, true), coalesce(n, false) or
> repartition(n)?
>
> Thanks & Regards
> Tridib
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19789.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Control number of parquet generated from JavaSchemaRDD

Reply via email to