Re: Question about Parallel Stages in Spark

2017-06-27 Thread satish lalam
Thanks Bryan. This is one Spark application with one job. This job has 3
stages. The first 2 are basic reads from cassandra tables and the 3rd is a
join between the two. I was expecting the first 2 stages to run in
parallel, however they run serially. Job has enough resources.

On Tue, Jun 27, 2017 at 4:03 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>
wrote:

> Satish,
>
> Is this two separate applications submitted to the Yarn scheduler? If so
> then you would expect that you would see the original case run in parallel.
>
> However, if this is one application your submission to Yarn guarantees
> that this application will fairly  contend with resources requested by
> other applications. However, the internal output operations within your
> application (jobs) will be scheduled by the driver (running on a single
> AM). This means that whatever driver options and code you've set will
> impact the application, but the Yarn scheduler will not impact (beyond
> allocating cores, memory, etc. between applications.)
>
>
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
>
> On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" <
> satish.la...@gmail.com> wrote:
>
> Thanks All. To reiterate - stages inside a job can be run parallely as
>> long as - (a) there is no sequential dependency (b) the job has sufficient
>> resources.
>> however, my code was launching 2 jobs and they are sequential as you
>> rightly pointed out.
>> The issue which I was trying to highlight with that piece of pseudocode
>> however was that - I am observing a job with 2 stages which dont depend on
>> each other (they both are reading data from 2 seperate tables in db), they
>> both are scheduled and both stages get resources - but the 2nd stage really
>> does not pick up until the 1st stage is complete. It might be due to the db
>> driver - I will post it to the right forum. Thanks.
>>
>> On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com>
>> wrote:
>>
>>> i think my words also misunderstood. My point is they will not submit
>>> together since they are the part of one thread.
>>>
>>> val spark =  SparkSession.builder()
>>>   .appName("practice")
>>>   .config("spark.scheduler.mode","FAIR")
>>>   .enableHiveSupport().getOrCreate()
>>> val sc = spark.sparkContext
>>> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
>>> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
>>> Thread.sleep(1000)
>>>
>>>
>>> I ran this and both spark submit time are different for both the jobs .
>>>
>>> Please let me if I am wrong
>>>
>>> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>>
>>>> My words cause misunderstanding.
>>>> Step 1:A is submited to spark.
>>>> Step 2:B is submitted to spark.
>>>>
>>>> Spark gets two independent jobs.The FAIR  is used to schedule A and B.
>>>>
>>>> Jeffrey' code did not cause two submit.
>>>>
>>>>
>>>>
>>>> ---Original---
>>>> *From:* "Pralabh Kumar"<pralabhku...@gmail.com>
>>>> *Date:* 2017/6/27 12:09:27
>>>> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
>>>> *Cc:* 
>>>> "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan
>>>> Jeffrey"<bryan.jeff...@gmail.com>;
>>>> *Subject:* Re: Question about Parallel Stages in Spark
>>>>
>>>> Hi
>>>>
>>>> I don't think so spark submit ,will receive two submits .  Its will
>>>> execute one submit and then to next one .  If the application is
>>>> multithreaded ,and two threads are calling spark submit and one time , then
>>>> they will run parallel provided the scheduler is FAIR and task slots are
>>>> available .
>>>>
>>>> But in one thread ,one submit will complete and then the another one
>>>> will start . If there are independent stages in one job, then those will
>>>> run parallel.
>>>>
>>>> I agree with Bryan Jeffrey .
>>>>
>>>>
>>>> Regards
>>>> Pralabh Kumar
>>>>
>>>> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>>>
>>>>> I think the spark cluster receives two submits, A and B.
>>>>

Re: Question about Parallel Stages in Spark

2017-06-27 Thread Bryan Jeffrey
Satish, 




Is this two separate applications submitted to the Yarn scheduler? If so then 
you would expect that you would see the original case run in parallel. 




However, if this is one application your submission to Yarn guarantees that 
this application will fairly  contend with resources requested by other 
applications. However, the internal output operations within your application 
(jobs) will be scheduled by the driver (running on a single AM). This means 
that whatever driver options and code you've set will impact the application, 
but the Yarn scheduler will not impact (beyond allocating cores, memory, etc. 
between applications.)








Get Outlook for Android







On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" <satish.la...@gmail.com> 
wrote:










Thanks All. To reiterate - stages inside a job can be run parallely as long as 
- (a) there is no sequential dependency (b) the job has sufficient resources. 
however, my code was launching 2 jobs and they are sequential as you rightly 
pointed out.The issue which I was trying to highlight with that piece of 
pseudocode however was that - I am observing a job with 2 stages which dont 
depend on each other (they both are reading data from 2 seperate tables in db), 
they both are scheduled and both stages get resources - but the 2nd stage 
really does not pick up until the 1st stage is complete. It might be due to the 
db driver - I will post it to the right forum. Thanks.
On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote:
i think my words also misunderstood. My point is they will not submit together 
since they are the part of one thread.  
val spark =  SparkSession.builder()
  .appName("practice")
  .config("spark.scheduler.mode","FAIR")
  .enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
Thread.sleep(1000)
I ran this and both spark submit time are different for both the jobs .
Please let me if I am wrong
On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
My words cause misunderstanding.Step 1:A is submited to spark.Step 2:B is 
submitted to spark.
Spark gets two independent jobs.The FAIR  is used to schedule A and B.
Jeffrey' code did not cause two submit.


 ---Original---From: "Pralabh Kumar"<pralabhku...@gmail.com>Date: 2017/6/27 
12:09:27To: "萝卜丝炒饭"<1427357...@qq.com>;Cc: 
"user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan 
Jeffrey"<bryan.jeff...@gmail.com>;Subject: Re: Question about Parallel Stages 
in Spark
Hi 
I don't think so spark submit ,will receive two submits .  Its will execute one 
submit and then to next one .  If the application is multithreaded ,and two 
threads are calling spark submit and one time , then they will run parallel 
provided the scheduler is FAIR and task slots are available . 
But in one thread ,one submit will complete and then the another one will start 
. If there are independent stages in one job, then those will run parallel.
I agree with Bryan Jeffrey .

RegardsPralabh Kumar
On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
I think the spark cluster receives two submits, A and B.The FAIR  is used to 
schedule A and B.I am not sure about this.
 ---Original---From: "Bryan Jeffrey"<bryan.jeff...@gmail.com>Date: 2017/6/27 
08:55:42To: "satishl"<satish.la...@gmail.com>;Cc: 
"user"<user@spark.apache.org>;Subject: Re: Question about Parallel Stages in 
Spark
Hello.
The driver is running the individual operations in series, but each operation 
is parallelized internally.  If you want them run in parallel you need to 
provide the driver a mechanism to thread the job scheduling out:
val rdd1 = sc.parallelize(1 to 10)
val rdd2 = sc.parallelize(1 to 20)

var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par

thingsToDo.foreach { case(rdd, index) =>
  for(i <- (1 to 1))
logger.info(s"Index ${index} - ${rdd.sum()}")
}
This will run both operations in parallel.

On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
For the below code, since rdd1 and rdd2 dont depend on each other - i was

expecting that both first and second printlns would be interwoven. However -

the spark job runs all "first " statements first and then all "seocnd"

statements next in serial fashion. I have set spark.scheduler.mode = FAIR.

obviously my understanding of parallel stages is wrong. What am I missing?



    val rdd1 = sc.parallelize(1 to 100)

    val rdd2 = sc.parallelize(1 to 100)



    for (i <- (1 to 100))

      println("first: " + rdd1.sum())

    for 

Re: Question about Parallel Stages in Spark

2017-06-27 Thread satish lalam
Thanks All. To reiterate - stages inside a job can be run parallely as long
as - (a) there is no sequential dependency (b) the job has sufficient
resources.
however, my code was launching 2 jobs and they are sequential as you
rightly pointed out.
The issue which I was trying to highlight with that piece of pseudocode
however was that - I am observing a job with 2 stages which dont depend on
each other (they both are reading data from 2 seperate tables in db), they
both are scheduled and both stages get resources - but the 2nd stage really
does not pick up until the 1st stage is complete. It might be due to the db
driver - I will post it to the right forum. Thanks.

On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> i think my words also misunderstood. My point is they will not submit
> together since they are the part of one thread.
>
> val spark =  SparkSession.builder()
>   .appName("practice")
>   .config("spark.scheduler.mode","FAIR")
>   .enableHiveSupport().getOrCreate()
> val sc = spark.sparkContext
> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
> Thread.sleep(1000)
>
>
> I ran this and both spark submit time are different for both the jobs .
>
> Please let me if I am wrong
>
> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>
>> My words cause misunderstanding.
>> Step 1:A is submited to spark.
>> Step 2:B is submitted to spark.
>>
>> Spark gets two independent jobs.The FAIR  is used to schedule A and B.
>>
>> Jeffrey' code did not cause two submit.
>>
>>
>>
>> ---Original---
>> *From:* "Pralabh Kumar"<pralabhku...@gmail.com>
>> *Date:* 2017/6/27 12:09:27
>> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
>> *Cc:* "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan
>> Jeffrey"<bryan.jeff...@gmail.com>;
>> *Subject:* Re: Question about Parallel Stages in Spark
>>
>> Hi
>>
>> I don't think so spark submit ,will receive two submits .  Its will
>> execute one submit and then to next one .  If the application is
>> multithreaded ,and two threads are calling spark submit and one time , then
>> they will run parallel provided the scheduler is FAIR and task slots are
>> available .
>>
>> But in one thread ,one submit will complete and then the another one will
>> start . If there are independent stages in one job, then those will run
>> parallel.
>>
>> I agree with Bryan Jeffrey .
>>
>>
>> Regards
>> Pralabh Kumar
>>
>> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>
>>> I think the spark cluster receives two submits, A and B.
>>> The FAIR  is used to schedule A and B.
>>> I am not sure about this.
>>>
>>> ---Original---
>>> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com>
>>> *Date:* 2017/6/27 08:55:42
>>> *To:* "satishl"<satish.la...@gmail.com>;
>>> *Cc:* "user"<user@spark.apache.org>;
>>> *Subject:* Re: Question about Parallel Stages in Spark
>>>
>>> Hello.
>>>
>>> The driver is running the individual operations in series, but each
>>> operation is parallelized internally.  If you want them run in parallel you
>>> need to provide the driver a mechanism to thread the job scheduling out:
>>>
>>> val rdd1 = sc.parallelize(1 to 10)
>>> val rdd2 = sc.parallelize(1 to 20)
>>>
>>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, 
>>> rdd2).zipWithIndex.par
>>>
>>> thingsToDo.foreach { case(rdd, index) =>
>>>   for(i <- (1 to 1))
>>> logger.info(s"Index ${index} - ${rdd.sum()}")
>>> }
>>>
>>>
>>> This will run both operations in parallel.
>>>
>>>
>>> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
>>>
>>>> For the below code, since rdd1 and rdd2 dont depend on each other - i
>>>> was
>>>> expecting that both first and second printlns would be interwoven.
>>>> However -
>>>> the spark job runs all "first " statements first and then all "seocnd"
>>>> statements next in serial fashion. I have set spark.scheduler.mode =
>>>> FAIR.
>>>> obviously my understanding of parallel stages is wrong. What am I
>>>> missing?
>>>>
>>>> val rdd1 = sc.parallelize(1 to 100)
>>>> val rdd2 = sc.parallelize(1 to 100)
>>>>
>>>> for (i <- (1 to 100))
>>>>   println("first: " + rdd1.sum())
>>>> for (i <- (1 to 100))
>>>>   println("second" + rdd2.sum())
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar
>>>> k-tp28793.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Question about Parallel Stages in Spark

2017-06-26 Thread Pralabh Kumar
i think my words also misunderstood. My point is they will not submit
together since they are the part of one thread.

val spark =  SparkSession.builder()
  .appName("practice")
  .config("spark.scheduler.mode","FAIR")
  .enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect()
Thread.sleep(1000)


I ran this and both spark submit time are different for both the jobs .

Please let me if I am wrong

On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> My words cause misunderstanding.
> Step 1:A is submited to spark.
> Step 2:B is submitted to spark.
>
> Spark gets two independent jobs.The FAIR  is used to schedule A and B.
>
> Jeffrey' code did not cause two submit.
>
>
>
> ---Original---
> *From:* "Pralabh Kumar"<pralabhku...@gmail.com>
> *Date:* 2017/6/27 12:09:27
> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
> *Cc:* "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan
> Jeffrey"<bryan.jeff...@gmail.com>;
> *Subject:* Re: Question about Parallel Stages in Spark
>
> Hi
>
> I don't think so spark submit ,will receive two submits .  Its will
> execute one submit and then to next one .  If the application is
> multithreaded ,and two threads are calling spark submit and one time , then
> they will run parallel provided the scheduler is FAIR and task slots are
> available .
>
> But in one thread ,one submit will complete and then the another one will
> start . If there are independent stages in one job, then those will run
> parallel.
>
> I agree with Bryan Jeffrey .
>
>
> Regards
> Pralabh Kumar
>
> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>
>> I think the spark cluster receives two submits, A and B.
>> The FAIR  is used to schedule A and B.
>> I am not sure about this.
>>
>> ---Original---
>> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com>
>> *Date:* 2017/6/27 08:55:42
>> *To:* "satishl"<satish.la...@gmail.com>;
>> *Cc:* "user"<user@spark.apache.org>;
>> *Subject:* Re: Question about Parallel Stages in Spark
>>
>> Hello.
>>
>> The driver is running the individual operations in series, but each
>> operation is parallelized internally.  If you want them run in parallel you
>> need to provide the driver a mechanism to thread the job scheduling out:
>>
>> val rdd1 = sc.parallelize(1 to 10)
>> val rdd2 = sc.parallelize(1 to 20)
>>
>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, 
>> rdd2).zipWithIndex.par
>>
>> thingsToDo.foreach { case(rdd, index) =>
>>   for(i <- (1 to 1))
>> logger.info(s"Index ${index} - ${rdd.sum()}")
>> }
>>
>>
>> This will run both operations in parallel.
>>
>>
>> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
>>
>>> For the below code, since rdd1 and rdd2 dont depend on each other - i was
>>> expecting that both first and second printlns would be interwoven.
>>> However -
>>> the spark job runs all "first " statements first and then all "seocnd"
>>> statements next in serial fashion. I have set spark.scheduler.mode =
>>> FAIR.
>>> obviously my understanding of parallel stages is wrong. What am I
>>> missing?
>>>
>>> val rdd1 = sc.parallelize(1 to 100)
>>> val rdd2 = sc.parallelize(1 to 100)
>>>
>>> for (i <- (1 to 100))
>>>   println("first: " + rdd1.sum())
>>> for (i <- (1 to 100))
>>>   println("second" + rdd2.sum())
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar
>>> k-tp28793.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: Question about Parallel Stages in Spark

2017-06-26 Thread ??????????
My words cause misunderstanding.
Step 1:A is submited to spark.
Step 2:B is submitted to spark.


Spark gets two independent jobs.The FAIR  is used to schedule A and B.


Jeffrey' code did not cause two submit.






 
---Original---
From: "Pralabh Kumar"<pralabhku...@gmail.com>
Date: 2017/6/27 12:09:27
To: "??"<1427357...@qq.com>;
Cc: "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan 
Jeffrey"<bryan.jeff...@gmail.com>;
Subject: Re: Question about Parallel Stages in Spark


Hi 

I don't think so spark submit ,will receive two submits .  Its will execute one 
submit and then to next one .  If the application is multithreaded ,and two 
threads are calling spark submit and one time , then they will run parallel 
provided the scheduler is FAIR and task slots are available . 


But in one thread ,one submit will complete and then the another one will start 
. If there are independent stages in one job, then those will run parallel.


I agree with Bryan Jeffrey .




Regards
Pralabh Kumar


On Tue, Jun 27, 2017 at 9:03 AM, ?? <1427357...@qq.com> wrote:
I think the spark cluster receives two submits, A and B.
The FAIR  is used to schedule A and B.
I am not sure about this.


 
---Original---
From: "Bryan Jeffrey"<bryan.jeff...@gmail.com>
Date: 2017/6/27 08:55:42
To: "satishl"<satish.la...@gmail.com>;
Cc: "user"<user@spark.apache.org>;
Subject: Re: Question about Parallel Stages in Spark


Hello.

The driver is running the individual operations in series, but each operation 
is parallelized internally.  If you want them run in parallel you need to 
provide the driver a mechanism to thread the job scheduling out:


val rdd1 = sc.parallelize(1 to 10)
val rdd2 = sc.parallelize(1 to 20)

var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par

thingsToDo.foreach { case(rdd, index) =>
  for(i <- (1 to 1))
logger.info(s"Index ${index} - ${rdd.sum()}")
}


This will run both operations in parallel.




On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
For the below code, since rdd1 and rdd2 dont depend on each other - i was
 expecting that both first and second printlns would be interwoven. However -
 the spark job runs all "first " statements first and then all "seocnd"
 statements next in serial fashion. I have set spark.scheduler.mode = FAIR.
 obviously my understanding of parallel stages is wrong. What am I missing?
 
 val rdd1 = sc.parallelize(1 to 100)
 val rdd2 = sc.parallelize(1 to 100)
 
 for (i <- (1 to 100))
   println("first: " + rdd1.sum())
 for (i <- (1 to 100))
   println("second" + rdd2.sum())
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Question about Parallel Stages in Spark

2017-06-26 Thread Pralabh Kumar
Hi

I don't think so spark submit ,will receive two submits .  Its will execute
one submit and then to next one .  If the application is multithreaded ,and
two threads are calling spark submit and one time , then they will run
parallel provided the scheduler is FAIR and task slots are available .

But in one thread ,one submit will complete and then the another one will
start . If there are independent stages in one job, then those will run
parallel.

I agree with Bryan Jeffrey .


Regards
Pralabh Kumar

On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> I think the spark cluster receives two submits, A and B.
> The FAIR  is used to schedule A and B.
> I am not sure about this.
>
> ---Original---
> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com>
> *Date:* 2017/6/27 08:55:42
> *To:* "satishl"<satish.la...@gmail.com>;
> *Cc:* "user"<user@spark.apache.org>;
> *Subject:* Re: Question about Parallel Stages in Spark
>
> Hello.
>
> The driver is running the individual operations in series, but each
> operation is parallelized internally.  If you want them run in parallel you
> need to provide the driver a mechanism to thread the job scheduling out:
>
> val rdd1 = sc.parallelize(1 to 10)
> val rdd2 = sc.parallelize(1 to 20)
>
> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par
>
> thingsToDo.foreach { case(rdd, index) =>
>   for(i <- (1 to 1))
> logger.info(s"Index ${index} - ${rdd.sum()}")
> }
>
>
> This will run both operations in parallel.
>
>
> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
>
>> For the below code, since rdd1 and rdd2 dont depend on each other - i was
>> expecting that both first and second printlns would be interwoven.
>> However -
>> the spark job runs all "first " statements first and then all "seocnd"
>> statements next in serial fashion. I have set spark.scheduler.mode = FAIR.
>> obviously my understanding of parallel stages is wrong. What am I missing?
>>
>> val rdd1 = sc.parallelize(1 to 100)
>> val rdd2 = sc.parallelize(1 to 100)
>>
>> for (i <- (1 to 100))
>>   println("first: " + rdd1.sum())
>> for (i <- (1 to 100))
>>   println("second" + rdd2.sum())
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-
>> Spark-tp28793.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Question about Parallel Stages in Spark

2017-06-26 Thread ??????????
I think the spark cluster receives two submits, A and B.
The FAIR  is used to schedule A and B.
I am not sure about this.


 
---Original---
From: "Bryan Jeffrey"<bryan.jeff...@gmail.com>
Date: 2017/6/27 08:55:42
To: "satishl"<satish.la...@gmail.com>;
Cc: "user"<user@spark.apache.org>;
Subject: Re: Question about Parallel Stages in Spark


Hello.

The driver is running the individual operations in series, but each operation 
is parallelized internally.  If you want them run in parallel you need to 
provide the driver a mechanism to thread the job scheduling out:


val rdd1 = sc.parallelize(1 to 10)
val rdd2 = sc.parallelize(1 to 20)

var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par

thingsToDo.foreach { case(rdd, index) =>
  for(i <- (1 to 1))
logger.info(s"Index ${index} - ${rdd.sum()}")
}


This will run both operations in parallel.




On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
For the below code, since rdd1 and rdd2 dont depend on each other - i was
 expecting that both first and second printlns would be interwoven. However -
 the spark job runs all "first " statements first and then all "seocnd"
 statements next in serial fashion. I have set spark.scheduler.mode = FAIR.
 obviously my understanding of parallel stages is wrong. What am I missing?
 
 val rdd1 = sc.parallelize(1 to 100)
 val rdd2 = sc.parallelize(1 to 100)
 
 for (i <- (1 to 100))
   println("first: " + rdd1.sum())
 for (i <- (1 to 100))
   println("second" + rdd2.sum())
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Question about Parallel Stages in Spark

2017-06-26 Thread Bryan Jeffrey
Hello.

The driver is running the individual operations in series, but each
operation is parallelized internally.  If you want them run in parallel you
need to provide the driver a mechanism to thread the job scheduling out:

val rdd1 = sc.parallelize(1 to 10)
val rdd2 = sc.parallelize(1 to 20)

var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par

thingsToDo.foreach { case(rdd, index) =>
  for(i <- (1 to 1))
logger.info(s"Index ${index} - ${rdd.sum()}")
}


This will run both operations in parallel.


On Mon, Jun 26, 2017 at 8:10 PM, satishl  wrote:

> For the below code, since rdd1 and rdd2 dont depend on each other - i was
> expecting that both first and second printlns would be interwoven. However
> -
> the spark job runs all "first " statements first and then all "seocnd"
> statements next in serial fashion. I have set spark.scheduler.mode = FAIR.
> obviously my understanding of parallel stages is wrong. What am I missing?
>
> val rdd1 = sc.parallelize(1 to 100)
> val rdd2 = sc.parallelize(1 to 100)
>
> for (i <- (1 to 100))
>   println("first: " + rdd1.sum())
> for (i <- (1 to 100))
>   println("second" + rdd2.sum())
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>