Re: partition size inherited from parent: auto coalesce

2017-01-16 Thread Takeshi Yamamuro
Hi,

The coalesce does not automatically happen now and you need to control the
number for yourself.
Basically, #partitions respect a `spark.default.parallelism`  number, by
default, #cores for your computer.
http://spark.apache.org/docs/latest/configuration.html#execution-behavior

// maropu

On Tue, Jan 17, 2017 at 11:58 AM, Suzen, Mehmet  wrote:

> Hello List,
>
>  I was wondering what is the design principle that partition size of
> an RDD is inherited from the parent.  See one simple example below
> [*]. 'ngauss_rdd2' has significantly less data, intuitively in such
> cases, shouldn't spark invoke coalesce automatically for performance?
> What would be the configuration option for this if there is any?
>
> Best,
> -m
>
> [*]
> // Generate 1 million Gaussian random numbers
> import util.Random
> Random.setSeed(4242)
> val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
> val ngauss_rdd = sc.parallelize(ngauss)
> ngauss_rdd.count // 1 million
> ngauss_rdd.partitions.size // 4
> val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
> ngauss_rdd2.count // 35
> ngauss_rdd2.partitions.size // 4
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
---
Takeshi Yamamuro


partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List,

 I was wondering what is the design principle that partition size of
an RDD is inherited from the parent.  See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would be the configuration option for this if there is any?

Best,
-m

[*]
// Generate 1 million Gaussian random numbers
import util.Random
Random.setSeed(4242)
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million
ngauss_rdd.partitions.size // 4
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
ngauss_rdd2.count // 35
ngauss_rdd2.partitions.size // 4

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



partition size inherited from parent: auto coalesce

2017-01-16 Thread Suzen, Mehmet
Hello List,

 I was wondering what is the design principle that partition size of
an RDD is inherited from the parent.  See one simple example below
[*]. 'ngauss_rdd2' has significantly less data, intuitively in such
cases, shouldn't spark invoke coalesce automatically for performance?
What would be the configuration option for this if there is any?

Best,
-m

[*]
// Generate 1 million Gaussian random numbers
import util.Random
Random.setSeed(4242)
val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian)
val ngauss_rdd = sc.parallelize(ngauss)
ngauss_rdd.count // 1 million
ngauss_rdd.partitions.size // 4
val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0)
ngauss_rdd2.count // 35
ngauss_rdd2.partitions.size // 4

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org