Hi, The coalesce does not automatically happen now and you need to control the number for yourself. Basically, #partitions respect a `spark.default.parallelism` number, by default, #cores for your computer. http://spark.apache.org/docs/latest/configuration.html#execution-behavior
// maropu On Tue, Jan 17, 2017 at 11:58 AM, Suzen, Mehmet <su...@acm.org> wrote: > Hello List, > > I was wondering what is the design principle that partition size of > an RDD is inherited from the parent. See one simple example below > [*]. 'ngauss_rdd2' has significantly less data, intuitively in such > cases, shouldn't spark invoke coalesce automatically for performance? > What would be the configuration option for this if there is any? > > Best, > -m > > [*] > // Generate 1 million Gaussian random numbers > import util.Random > Random.setSeed(4242) > val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian) > val ngauss_rdd = sc.parallelize(ngauss) > ngauss_rdd.count // 1 million > ngauss_rdd.partitions.size // 4 > val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0) > ngauss_rdd2.count // 35 > ngauss_rdd2.partitions.size // 4 > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro