Re: partition size inherited from parent: auto coalesce

Takeshi Yamamuro Mon, 16 Jan 2017 19:22:05 -0800

Hi,

The coalesce does not automatically happen now and you need to control the
number for yourself.
Basically, #partitions respect a `spark.default.parallelism`  number, by
default, #cores for your computer.
http://spark.apache.org/docs/latest/configuration.html#execution-behavior


// maropu

On Tue, Jan 17, 2017 at 11:58 AM, Suzen, Mehmet <su...@acm.org> wrote:

> Hello List,
>
>  I was wondering what is the design principle that partition size of
> an RDD is inherited from the parent.  See one simple example below
> [*]. 'ngauss_rdd2' has significantly less data, intuitively in such
> cases, shouldn't spark invoke coalesce automatically for performance?
> What would be the configuration option for this if there is any?
>
> Best,
> -m
>
> [*]
> // Generate 1 million Gaussian random numbers
> import util.Random
> Random.setSeed(4242)
> val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
> val ngauss_rdd = sc.parallelize(ngauss)
> ngauss_rdd.count // 1 million
> ngauss_rdd.partitions.size // 4
> val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
> ngauss_rdd2.count // 35
> ngauss_rdd2.partitions.size // 4
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro

Re: partition size inherited from parent: auto coalesce

Reply via email to