Re: partition size inherited from parent: auto coalesce
Hi, The coalesce does not automatically happen now and you need to control the number for yourself. Basically, #partitions respect a `spark.default.parallelism` number, by default, #cores for your computer. http://spark.apache.org/docs/latest/configuration.html#execution-behavior // maropu On Tue, Jan 17, 2017 at 11:58 AM, Suzen, Mehmet wrote: > Hello List, > > I was wondering what is the design principle that partition size of > an RDD is inherited from the parent. See one simple example below > [*]. 'ngauss_rdd2' has significantly less data, intuitively in such > cases, shouldn't spark invoke coalesce automatically for performance? > What would be the configuration option for this if there is any? > > Best, > -m > > [*] > // Generate 1 million Gaussian random numbers > import util.Random > Random.setSeed(4242) > val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian) > val ngauss_rdd = sc.parallelize(ngauss) > ngauss_rdd.count // 1 million > ngauss_rdd.partitions.size // 4 > val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0) > ngauss_rdd2.count // 35 > ngauss_rdd2.partitions.size // 4 > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro
partition size inherited from parent: auto coalesce
Hello List, I was wondering what is the design principle that partition size of an RDD is inherited from the parent. See one simple example below [*]. 'ngauss_rdd2' has significantly less data, intuitively in such cases, shouldn't spark invoke coalesce automatically for performance? What would be the configuration option for this if there is any? Best, -m [*] // Generate 1 million Gaussian random numbers import util.Random Random.setSeed(4242) val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian) val ngauss_rdd = sc.parallelize(ngauss) ngauss_rdd.count // 1 million ngauss_rdd.partitions.size // 4 val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0) ngauss_rdd2.count // 35 ngauss_rdd2.partitions.size // 4 - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
partition size inherited from parent: auto coalesce
Hello List, I was wondering what is the design principle that partition size of an RDD is inherited from the parent. See one simple example below [*]. 'ngauss_rdd2' has significantly less data, intuitively in such cases, shouldn't spark invoke coalesce automatically for performance? What would be the configuration option for this if there is any? Best, -m [*] // Generate 1 million Gaussian random numbers import util.Random Random.setSeed(4242) val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian) val ngauss_rdd = sc.parallelize(ngauss) ngauss_rdd.count // 1 million ngauss_rdd.partitions.size // 4 val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0) ngauss_rdd2.count // 35 ngauss_rdd2.partitions.size // 4 - To unsubscribe e-mail: user-unsubscr...@spark.apache.org