Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
Hi,
If I repartition my data by a factor equal to the number of worker
instances, will the performance be better or worse?
As far as I understand, the performance should be better, but in my case it
is becoming worse.
I have a single node standalone cluster, is it because of this?
Am I guaranteed to have a better performance if I do the same thing in a
multi-node cluster?

Thank You


Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
are slots for task threads. The slot count is configured by the num_cores
setting. Generally over subscribe this. So if you have 10 free CPU cores,
set num_cores to 20.

On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 How is task slot different from # of Workers?


  so don't read into any performance metrics you've collected to
 extrapolate what may happen at scale.
 I did not get you in this.

 Thank You

 On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
 javascript:_e(%7B%7D,'cvml','same...@databricks.com'); wrote:

 In general you should first figure out how many task slots are in the
 cluster and then repartition the RDD to maybe 2x that #. So if you have a
 100 slots, then maybe RDDs with partition count of 100-300 would be normal.

 But also size of each partition can matter. You want a task to operate on
 a partition for at least 200ms, but no longer than around 20 seconds.

 Even if you have 100 slots, it could be okay to have a RDD with 10,000
 partitions if you've read in a large file.

 So don't repartition your RDD to match the # of Worker JVMs, but rather
 align it to the total # of task slots in the Executors.

 If you're running on a single node, shuffle operations become almost free
 (because there's no network movement), so don't read into any
 performance metrics you've collected to extrapolate what may happen at
 scale.


 On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
 javascript:_e(%7B%7D,'cvml','pradhandeep1...@gmail.com'); wrote:

 Hi,
 If I repartition my data by a factor equal to the number of worker
 instances, will the performance be better or worse?
 As far as I understand, the performance should be better, but in my case
 it is becoming worse.
 I have a single node standalone cluster, is it because of this?
 Am I guaranteed to have a better performance if I do the same thing in a
 multi-node cluster?

 Thank You





Re: Repartition and Worker Instances

2015-02-23 Thread Sameer Farooqui
In general you should first figure out how many task slots are in the
cluster and then repartition the RDD to maybe 2x that #. So if you have a
100 slots, then maybe RDDs with partition count of 100-300 would be normal.

But also size of each partition can matter. You want a task to operate on a
partition for at least 200ms, but no longer than around 20 seconds.

Even if you have 100 slots, it could be okay to have a RDD with 10,000
partitions if you've read in a large file.

So don't repartition your RDD to match the # of Worker JVMs, but rather
align it to the total # of task slots in the Executors.

If you're running on a single node, shuffle operations become almost free
(because there's no network movement), so don't read into any performance
metrics you've collected to extrapolate what may happen at scale.


On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 If I repartition my data by a factor equal to the number of worker
 instances, will the performance be better or worse?
 As far as I understand, the performance should be better, but in my case
 it is becoming worse.
 I have a single node standalone cluster, is it because of this?
 Am I guaranteed to have a better performance if I do the same thing in a
 multi-node cluster?

 Thank You



Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
How is task slot different from # of Workers?


 so don't read into any performance metrics you've collected to
extrapolate what may happen at scale.
I did not get you in this.

Thank You

On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
wrote:

 In general you should first figure out how many task slots are in the
 cluster and then repartition the RDD to maybe 2x that #. So if you have a
 100 slots, then maybe RDDs with partition count of 100-300 would be normal.

 But also size of each partition can matter. You want a task to operate on
 a partition for at least 200ms, but no longer than around 20 seconds.

 Even if you have 100 slots, it could be okay to have a RDD with 10,000
 partitions if you've read in a large file.

 So don't repartition your RDD to match the # of Worker JVMs, but rather
 align it to the total # of task slots in the Executors.

 If you're running on a single node, shuffle operations become almost free
 (because there's no network movement), so don't read into any performance
 metrics you've collected to extrapolate what may happen at scale.


 On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 If I repartition my data by a factor equal to the number of worker
 instances, will the performance be better or worse?
 As far as I understand, the performance should be better, but in my case
 it is becoming worse.
 I have a single node standalone cluster, is it because of this?
 Am I guaranteed to have a better performance if I do the same thing in a
 multi-node cluster?

 Thank You




Re: Repartition and Worker Instances

2015-02-23 Thread Deep Pradhan
You mean SPARK_WORKER_CORES in /conf/spark-env.sh?

On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com
wrote:

 In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there
 are slots for task threads. The slot count is configured by the num_cores
 setting. Generally over subscribe this. So if you have 10 free CPU cores,
 set num_cores to 20.


 On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 How is task slot different from # of Workers?


  so don't read into any performance metrics you've collected to
 extrapolate what may happen at scale.
 I did not get you in this.

 Thank You

 On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com
  wrote:

 In general you should first figure out how many task slots are in the
 cluster and then repartition the RDD to maybe 2x that #. So if you have a
 100 slots, then maybe RDDs with partition count of 100-300 would be normal.

 But also size of each partition can matter. You want a task to operate
 on a partition for at least 200ms, but no longer than around 20 seconds.

 Even if you have 100 slots, it could be okay to have a RDD with 10,000
 partitions if you've read in a large file.

 So don't repartition your RDD to match the # of Worker JVMs, but rather
 align it to the total # of task slots in the Executors.

 If you're running on a single node, shuffle operations become almost
 free (because there's no network movement), so don't read into any
 performance metrics you've collected to extrapolate what may happen at
 scale.


 On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 If I repartition my data by a factor equal to the number of worker
 instances, will the performance be better or worse?
 As far as I understand, the performance should be better, but in my
 case it is becoming worse.
 I have a single node standalone cluster, is it because of this?
 Am I guaranteed to have a better performance if I do the same thing in
 a multi-node cluster?

 Thank You