subject:"Doing RDD.count in parallel , at at least parallelize it as much as possible\?"

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-31 Thread Sean Owen

cache() won't speed up a single operation on an RDD, since it is
computed the same way before it is persisted.

On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui same...@databricks.com wrote:
 By the way, in case you haven't done so, do try to .cache() the RDD before
 running a .count() on it as that could make a big speed improvement.



 On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com
 wrote:

 Hi Shahab,

 Are you running Spark in Local, Standalone, YARN or Mesos mode?

 If you're running in Standalone/YARN/Mesos, then the .count() action is
 indeed automatically parallelized across multiple Executors.

 When you run a .count() on an RDD, it is actually distributing tasks to
 different executors to each do a local count on a local partition and then
 all the tasks send their sub-counts back to the driver for final
 aggregation. This sounds like the kind of behavior you're looking for.

 However, in Local mode, everything runs in a single JVM (the driver +
 executor), so there's no parallelization across Executors.



 On Thu, Oct 30, 2014 at 10:25 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I noticed that the count (of RDD)  in many of my queries is the most
 time consuming one as it runs in the driver process rather then done by
 parallel worker nodes,

 Is there any way to perform count in parallel , at at least parallelize
 it as much as possible?

 best,
 /Shahab




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread shahab

Hi,

I noticed that the count (of RDD)  in many of my queries is the most time
consuming one as it runs in the driver process rather then done by
parallel worker nodes,

Is there any way to perform count in parallel , at at least parallelize
 it as much as possible?

best,
/Shahab

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui

Hi Shahab,

Are you running Spark in Local, Standalone, YARN or Mesos mode?

If you're running in Standalone/YARN/Mesos, then the .count() action is
indeed automatically parallelized across multiple Executors.

When you run a .count() on an RDD, it is actually distributing tasks to
different executors to each do a local count on a local partition and then
all the tasks send their sub-counts back to the driver for final
aggregation. This sounds like the kind of behavior you're looking for.

However, in Local mode, everything runs in a single JVM (the driver +
executor), so there's no parallelization across Executors.



On Thu, Oct 30, 2014 at 10:25 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I noticed that the count (of RDD)  in many of my queries is the most
 time consuming one as it runs in the driver process rather then done by
 parallel worker nodes,

 Is there any way to perform count in parallel , at at least parallelize
  it as much as possible?

 best,
 /Shahab

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sameer Farooqui

By the way, in case you haven't done so, do try to .cache() the RDD before
running a .count() on it as that could make a big speed improvement.



On Thu, Oct 30, 2014 at 11:12 AM, Sameer Farooqui same...@databricks.com
wrote:

 Hi Shahab,

 Are you running Spark in Local, Standalone, YARN or Mesos mode?

 If you're running in Standalone/YARN/Mesos, then the .count() action is
 indeed automatically parallelized across multiple Executors.

 When you run a .count() on an RDD, it is actually distributing tasks to
 different executors to each do a local count on a local partition and then
 all the tasks send their sub-counts back to the driver for final
 aggregation. This sounds like the kind of behavior you're looking for.

 However, in Local mode, everything runs in a single JVM (the driver +
 executor), so there's no parallelization across Executors.



 On Thu, Oct 30, 2014 at 10:25 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I noticed that the count (of RDD)  in many of my queries is the most
 time consuming one as it runs in the driver process rather then done by
 parallel worker nodes,

 Is there any way to perform count in parallel , at at least parallelize
  it as much as possible?

 best,
 /Shahab

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread Sonal Goyal

Hey Sameer,

Wouldnt local[x] run count parallelly in each of the x threads?

Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Thu, Oct 30, 2014 at 11:42 PM, Sameer Farooqui same...@databricks.com
wrote:

 Hi Shahab,

 Are you running Spark in Local, Standalone, YARN or Mesos mode?

 If you're running in Standalone/YARN/Mesos, then the .count() action is
 indeed automatically parallelized across multiple Executors.

 When you run a .count() on an RDD, it is actually distributing tasks to
 different executors to each do a local count on a local partition and then
 all the tasks send their sub-counts back to the driver for final
 aggregation. This sounds like the kind of behavior you're looking for.

 However, in Local mode, everything runs in a single JVM (the driver +
 executor), so there's no parallelization across Executors.



 On Thu, Oct 30, 2014 at 10:25 AM, shahab shahab.mok...@gmail.com wrote:

 Hi,

 I noticed that the count (of RDD)  in many of my queries is the most
 time consuming one as it runs in the driver process rather then done by
 parallel worker nodes,

 Is there any way to perform count in parallel , at at least parallelize
  it as much as possible?

 best,
 /Shahab

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

Doing RDD.count in parallel , at at least parallelize it as much as possible?

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

5 matches

Site Navigation

Mail list logo

Footer information