I believe coalesce(..., true) and repartition are the same. If the input files are of similar sizes, then coalesce will be cheaper as it introduces a narrow dependency <https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf>, meaning there won't be a shuffle. However, if there is a lot of skew in the input file size, then a repartition will ensure that data is shuffled evenly.
There is currently no way to control the file size other than pick a 'good' number of partitions. On Tue, Nov 25, 2014 at 11:30 AM, tridib <tridib.sama...@live.com> wrote: > Thanks Michael, > It worked like a charm! I have few more queries: > 1. Is there a way to control the size of parquet file? > 2. Which method do you recommend coalesce(n, true), coalesce(n, false) or > repartition(n)? > > Thanks & Regards > Tridib > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Control-number-of-parquet-generated-from-JavaSchemaRDD-tp19717p19789.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >