Repartition and Worker Instances
Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You
Re: Repartition and Worker Instances
In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting. Generally over subscribe this. So if you have 10 free CPU cores, set num_cores to 20. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com javascript:_e(%7B%7D,'cvml','same...@databricks.com'); wrote: In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a partition for at least 200ms, but no longer than around 20 seconds. Even if you have 100 slots, it could be okay to have a RDD with 10,000 partitions if you've read in a large file. So don't repartition your RDD to match the # of Worker JVMs, but rather align it to the total # of task slots in the Executors. If you're running on a single node, shuffle operations become almost free (because there's no network movement), so don't read into any performance metrics you've collected to extrapolate what may happen at scale. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com javascript:_e(%7B%7D,'cvml','pradhandeep1...@gmail.com'); wrote: Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You
Re: Repartition and Worker Instances
In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a partition for at least 200ms, but no longer than around 20 seconds. Even if you have 100 slots, it could be okay to have a RDD with 10,000 partitions if you've read in a large file. So don't repartition your RDD to match the # of Worker JVMs, but rather align it to the total # of task slots in the Executors. If you're running on a single node, shuffle operations become almost free (because there's no network movement), so don't read into any performance metrics you've collected to extrapolate what may happen at scale. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You
Re: Repartition and Worker Instances
How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com wrote: In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a partition for at least 200ms, but no longer than around 20 seconds. Even if you have 100 slots, it could be okay to have a RDD with 10,000 partitions if you've read in a large file. So don't repartition your RDD to match the # of Worker JVMs, but rather align it to the total # of task slots in the Executors. If you're running on a single node, shuffle operations become almost free (because there's no network movement), so don't read into any performance metrics you've collected to extrapolate what may happen at scale. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You
Re: Repartition and Worker Instances
You mean SPARK_WORKER_CORES in /conf/spark-env.sh? On Mon, Feb 23, 2015 at 11:06 PM, Sameer Farooqui same...@databricks.com wrote: In Standalone mode, a Worker JVM starts an Executor. Inside the Exec there are slots for task threads. The slot count is configured by the num_cores setting. Generally over subscribe this. So if you have 10 free CPU cores, set num_cores to 20. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: How is task slot different from # of Workers? so don't read into any performance metrics you've collected to extrapolate what may happen at scale. I did not get you in this. Thank You On Mon, Feb 23, 2015 at 10:52 PM, Sameer Farooqui same...@databricks.com wrote: In general you should first figure out how many task slots are in the cluster and then repartition the RDD to maybe 2x that #. So if you have a 100 slots, then maybe RDDs with partition count of 100-300 would be normal. But also size of each partition can matter. You want a task to operate on a partition for at least 200ms, but no longer than around 20 seconds. Even if you have 100 slots, it could be okay to have a RDD with 10,000 partitions if you've read in a large file. So don't repartition your RDD to match the # of Worker JVMs, but rather align it to the total # of task slots in the Executors. If you're running on a single node, shuffle operations become almost free (because there's no network movement), so don't read into any performance metrics you've collected to extrapolate what may happen at scale. On Monday, February 23, 2015, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, If I repartition my data by a factor equal to the number of worker instances, will the performance be better or worse? As far as I understand, the performance should be better, but in my case it is becoming worse. I have a single node standalone cluster, is it because of this? Am I guaranteed to have a better performance if I do the same thing in a multi-node cluster? Thank You