Re: Question about Parallel Stages in Spark
Thanks Bryan. This is one Spark application with one job. This job has 3 stages. The first 2 are basic reads from cassandra tables and the 3rd is a join between the two. I was expecting the first 2 stages to run in parallel, however they run serially. Job has enough resources. On Tue, Jun 27, 2017 at 4:03 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Satish, > > Is this two separate applications submitted to the Yarn scheduler? If so > then you would expect that you would see the original case run in parallel. > > However, if this is one application your submission to Yarn guarantees > that this application will fairly contend with resources requested by > other applications. However, the internal output operations within your > application (jobs) will be scheduled by the driver (running on a single > AM). This means that whatever driver options and code you've set will > impact the application, but the Yarn scheduler will not impact (beyond > allocating cores, memory, etc. between applications.) > > > > Get Outlook for Android <https://aka.ms/ghei36> > > > > > On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" < > satish.la...@gmail.com> wrote: > > Thanks All. To reiterate - stages inside a job can be run parallely as >> long as - (a) there is no sequential dependency (b) the job has sufficient >> resources. >> however, my code was launching 2 jobs and they are sequential as you >> rightly pointed out. >> The issue which I was trying to highlight with that piece of pseudocode >> however was that - I am observing a job with 2 stages which dont depend on >> each other (they both are reading data from 2 seperate tables in db), they >> both are scheduled and both stages get resources - but the 2nd stage really >> does not pick up until the 1st stage is complete. It might be due to the db >> driver - I will post it to the right forum. Thanks. >> >> On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com> >> wrote: >> >>> i think my words also misunderstood. My point is they will not submit >>> together since they are the part of one thread. >>> >>> val spark = SparkSession.builder() >>> .appName("practice") >>> .config("spark.scheduler.mode","FAIR") >>> .enableHiveSupport().getOrCreate() >>> val sc = spark.sparkContext >>> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() >>> sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() >>> Thread.sleep(1000) >>> >>> >>> I ran this and both spark submit time are different for both the jobs . >>> >>> Please let me if I am wrong >>> >>> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: >>> >>>> My words cause misunderstanding. >>>> Step 1:A is submited to spark. >>>> Step 2:B is submitted to spark. >>>> >>>> Spark gets two independent jobs.The FAIR is used to schedule A and B. >>>> >>>> Jeffrey' code did not cause two submit. >>>> >>>> >>>> >>>> ---Original--- >>>> *From:* "Pralabh Kumar"<pralabhku...@gmail.com> >>>> *Date:* 2017/6/27 12:09:27 >>>> *To:* "萝卜丝炒饭"<1427357...@qq.com>; >>>> *Cc:* >>>> "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan >>>> Jeffrey"<bryan.jeff...@gmail.com>; >>>> *Subject:* Re: Question about Parallel Stages in Spark >>>> >>>> Hi >>>> >>>> I don't think so spark submit ,will receive two submits . Its will >>>> execute one submit and then to next one . If the application is >>>> multithreaded ,and two threads are calling spark submit and one time , then >>>> they will run parallel provided the scheduler is FAIR and task slots are >>>> available . >>>> >>>> But in one thread ,one submit will complete and then the another one >>>> will start . If there are independent stages in one job, then those will >>>> run parallel. >>>> >>>> I agree with Bryan Jeffrey . >>>> >>>> >>>> Regards >>>> Pralabh Kumar >>>> >>>> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: >>>> >>>>> I think the spark cluster receives two submits, A and B. >>>>
Re: Question about Parallel Stages in Spark
Satish, Is this two separate applications submitted to the Yarn scheduler? If so then you would expect that you would see the original case run in parallel. However, if this is one application your submission to Yarn guarantees that this application will fairly contend with resources requested by other applications. However, the internal output operations within your application (jobs) will be scheduled by the driver (running on a single AM). This means that whatever driver options and code you've set will impact the application, but the Yarn scheduler will not impact (beyond allocating cores, memory, etc. between applications.) Get Outlook for Android On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" <satish.la...@gmail.com> wrote: Thanks All. To reiterate - stages inside a job can be run parallely as long as - (a) there is no sequential dependency (b) the job has sufficient resources. however, my code was launching 2 jobs and they are sequential as you rightly pointed out.The issue which I was trying to highlight with that piece of pseudocode however was that - I am observing a job with 2 stages which dont depend on each other (they both are reading data from 2 seperate tables in db), they both are scheduled and both stages get resources - but the 2nd stage really does not pick up until the 1st stage is complete. It might be due to the db driver - I will post it to the right forum. Thanks. On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: i think my words also misunderstood. My point is they will not submit together since they are the part of one thread. val spark = SparkSession.builder() .appName("practice") .config("spark.scheduler.mode","FAIR") .enableHiveSupport().getOrCreate() val sc = spark.sparkContext sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() Thread.sleep(1000) I ran this and both spark submit time are different for both the jobs . Please let me if I am wrong On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: My words cause misunderstanding.Step 1:A is submited to spark.Step 2:B is submitted to spark. Spark gets two independent jobs.The FAIR is used to schedule A and B. Jeffrey' code did not cause two submit. ---Original---From: "Pralabh Kumar"<pralabhku...@gmail.com>Date: 2017/6/27 12:09:27To: "萝卜丝炒饭"<1427357...@qq.com>;Cc: "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan Jeffrey"<bryan.jeff...@gmail.com>;Subject: Re: Question about Parallel Stages in Spark Hi I don't think so spark submit ,will receive two submits . Its will execute one submit and then to next one . If the application is multithreaded ,and two threads are calling spark submit and one time , then they will run parallel provided the scheduler is FAIR and task slots are available . But in one thread ,one submit will complete and then the another one will start . If there are independent stages in one job, then those will run parallel. I agree with Bryan Jeffrey . RegardsPralabh Kumar On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: I think the spark cluster receives two submits, A and B.The FAIR is used to schedule A and B.I am not sure about this. ---Original---From: "Bryan Jeffrey"<bryan.jeff...@gmail.com>Date: 2017/6/27 08:55:42To: "satishl"<satish.la...@gmail.com>;Cc: "user"<user@spark.apache.org>;Subject: Re: Question about Parallel Stages in Spark Hello. The driver is running the individual operations in series, but each operation is parallelized internally. If you want them run in parallel you need to provide the driver a mechanism to thread the job scheduling out: val rdd1 = sc.parallelize(1 to 10) val rdd2 = sc.parallelize(1 to 20) var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par thingsToDo.foreach { case(rdd, index) => for(i <- (1 to 1)) logger.info(s"Index ${index} - ${rdd.sum()}") } This will run both operations in parallel. On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: For the below code, since rdd1 and rdd2 dont depend on each other - i was expecting that both first and second printlns would be interwoven. However - the spark job runs all "first " statements first and then all "seocnd" statements next in serial fashion. I have set spark.scheduler.mode = FAIR. obviously my understanding of parallel stages is wrong. What am I missing? val rdd1 = sc.parallelize(1 to 100) val rdd2 = sc.parallelize(1 to 100) for (i <- (1 to 100)) println("first: " + rdd1.sum()) for
Re: Question about Parallel Stages in Spark
Thanks All. To reiterate - stages inside a job can be run parallely as long as - (a) there is no sequential dependency (b) the job has sufficient resources. however, my code was launching 2 jobs and they are sequential as you rightly pointed out. The issue which I was trying to highlight with that piece of pseudocode however was that - I am observing a job with 2 stages which dont depend on each other (they both are reading data from 2 seperate tables in db), they both are scheduled and both stages get resources - but the 2nd stage really does not pick up until the 1st stage is complete. It might be due to the db driver - I will post it to the right forum. Thanks. On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > i think my words also misunderstood. My point is they will not submit > together since they are the part of one thread. > > val spark = SparkSession.builder() > .appName("practice") > .config("spark.scheduler.mode","FAIR") > .enableHiveSupport().getOrCreate() > val sc = spark.sparkContext > sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() > sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() > Thread.sleep(1000) > > > I ran this and both spark submit time are different for both the jobs . > > Please let me if I am wrong > > On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > >> My words cause misunderstanding. >> Step 1:A is submited to spark. >> Step 2:B is submitted to spark. >> >> Spark gets two independent jobs.The FAIR is used to schedule A and B. >> >> Jeffrey' code did not cause two submit. >> >> >> >> ---Original--- >> *From:* "Pralabh Kumar"<pralabhku...@gmail.com> >> *Date:* 2017/6/27 12:09:27 >> *To:* "萝卜丝炒饭"<1427357...@qq.com>; >> *Cc:* "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan >> Jeffrey"<bryan.jeff...@gmail.com>; >> *Subject:* Re: Question about Parallel Stages in Spark >> >> Hi >> >> I don't think so spark submit ,will receive two submits . Its will >> execute one submit and then to next one . If the application is >> multithreaded ,and two threads are calling spark submit and one time , then >> they will run parallel provided the scheduler is FAIR and task slots are >> available . >> >> But in one thread ,one submit will complete and then the another one will >> start . If there are independent stages in one job, then those will run >> parallel. >> >> I agree with Bryan Jeffrey . >> >> >> Regards >> Pralabh Kumar >> >> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: >> >>> I think the spark cluster receives two submits, A and B. >>> The FAIR is used to schedule A and B. >>> I am not sure about this. >>> >>> ---Original--- >>> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com> >>> *Date:* 2017/6/27 08:55:42 >>> *To:* "satishl"<satish.la...@gmail.com>; >>> *Cc:* "user"<user@spark.apache.org>; >>> *Subject:* Re: Question about Parallel Stages in Spark >>> >>> Hello. >>> >>> The driver is running the individual operations in series, but each >>> operation is parallelized internally. If you want them run in parallel you >>> need to provide the driver a mechanism to thread the job scheduling out: >>> >>> val rdd1 = sc.parallelize(1 to 10) >>> val rdd2 = sc.parallelize(1 to 20) >>> >>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, >>> rdd2).zipWithIndex.par >>> >>> thingsToDo.foreach { case(rdd, index) => >>> for(i <- (1 to 1)) >>> logger.info(s"Index ${index} - ${rdd.sum()}") >>> } >>> >>> >>> This will run both operations in parallel. >>> >>> >>> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: >>> >>>> For the below code, since rdd1 and rdd2 dont depend on each other - i >>>> was >>>> expecting that both first and second printlns would be interwoven. >>>> However - >>>> the spark job runs all "first " statements first and then all "seocnd" >>>> statements next in serial fashion. I have set spark.scheduler.mode = >>>> FAIR. >>>> obviously my understanding of parallel stages is wrong. What am I >>>> missing? >>>> >>>> val rdd1 = sc.parallelize(1 to 100) >>>> val rdd2 = sc.parallelize(1 to 100) >>>> >>>> for (i <- (1 to 100)) >>>> println("first: " + rdd1.sum()) >>>> for (i <- (1 to 100)) >>>> println("second" + rdd2.sum()) >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://apache-spark-user-list. >>>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar >>>> k-tp28793.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> - >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >> >
Re: Question about Parallel Stages in Spark
i think my words also misunderstood. My point is they will not submit together since they are the part of one thread. val spark = SparkSession.builder() .appName("practice") .config("spark.scheduler.mode","FAIR") .enableHiveSupport().getOrCreate() val sc = spark.sparkContext sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() sc.parallelize(List(1.to(1000))).map(s=>Thread.sleep(1)).collect() Thread.sleep(1000) I ran this and both spark submit time are different for both the jobs . Please let me if I am wrong On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > My words cause misunderstanding. > Step 1:A is submited to spark. > Step 2:B is submitted to spark. > > Spark gets two independent jobs.The FAIR is used to schedule A and B. > > Jeffrey' code did not cause two submit. > > > > ---Original--- > *From:* "Pralabh Kumar"<pralabhku...@gmail.com> > *Date:* 2017/6/27 12:09:27 > *To:* "萝卜丝炒饭"<1427357...@qq.com>; > *Cc:* "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan > Jeffrey"<bryan.jeff...@gmail.com>; > *Subject:* Re: Question about Parallel Stages in Spark > > Hi > > I don't think so spark submit ,will receive two submits . Its will > execute one submit and then to next one . If the application is > multithreaded ,and two threads are calling spark submit and one time , then > they will run parallel provided the scheduler is FAIR and task slots are > available . > > But in one thread ,one submit will complete and then the another one will > start . If there are independent stages in one job, then those will run > parallel. > > I agree with Bryan Jeffrey . > > > Regards > Pralabh Kumar > > On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > >> I think the spark cluster receives two submits, A and B. >> The FAIR is used to schedule A and B. >> I am not sure about this. >> >> ---Original--- >> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com> >> *Date:* 2017/6/27 08:55:42 >> *To:* "satishl"<satish.la...@gmail.com>; >> *Cc:* "user"<user@spark.apache.org>; >> *Subject:* Re: Question about Parallel Stages in Spark >> >> Hello. >> >> The driver is running the individual operations in series, but each >> operation is parallelized internally. If you want them run in parallel you >> need to provide the driver a mechanism to thread the job scheduling out: >> >> val rdd1 = sc.parallelize(1 to 10) >> val rdd2 = sc.parallelize(1 to 20) >> >> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, >> rdd2).zipWithIndex.par >> >> thingsToDo.foreach { case(rdd, index) => >> for(i <- (1 to 1)) >> logger.info(s"Index ${index} - ${rdd.sum()}") >> } >> >> >> This will run both operations in parallel. >> >> >> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: >> >>> For the below code, since rdd1 and rdd2 dont depend on each other - i was >>> expecting that both first and second printlns would be interwoven. >>> However - >>> the spark job runs all "first " statements first and then all "seocnd" >>> statements next in serial fashion. I have set spark.scheduler.mode = >>> FAIR. >>> obviously my understanding of parallel stages is wrong. What am I >>> missing? >>> >>> val rdd1 = sc.parallelize(1 to 100) >>> val rdd2 = sc.parallelize(1 to 100) >>> >>> for (i <- (1 to 100)) >>> println("first: " + rdd1.sum()) >>> for (i <- (1 to 100)) >>> println("second" + rdd2.sum()) >>> >>> >>> >>> -- >>> View this message in context: http://apache-spark-user-list. >>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar >>> k-tp28793.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >
Re: Question about Parallel Stages in Spark
My words cause misunderstanding. Step 1:A is submited to spark. Step 2:B is submitted to spark. Spark gets two independent jobs.The FAIR is used to schedule A and B. Jeffrey' code did not cause two submit. ---Original--- From: "Pralabh Kumar"<pralabhku...@gmail.com> Date: 2017/6/27 12:09:27 To: "??"<1427357...@qq.com>; Cc: "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan Jeffrey"<bryan.jeff...@gmail.com>; Subject: Re: Question about Parallel Stages in Spark Hi I don't think so spark submit ,will receive two submits . Its will execute one submit and then to next one . If the application is multithreaded ,and two threads are calling spark submit and one time , then they will run parallel provided the scheduler is FAIR and task slots are available . But in one thread ,one submit will complete and then the another one will start . If there are independent stages in one job, then those will run parallel. I agree with Bryan Jeffrey . Regards Pralabh Kumar On Tue, Jun 27, 2017 at 9:03 AM, ?? <1427357...@qq.com> wrote: I think the spark cluster receives two submits, A and B. The FAIR is used to schedule A and B. I am not sure about this. ---Original--- From: "Bryan Jeffrey"<bryan.jeff...@gmail.com> Date: 2017/6/27 08:55:42 To: "satishl"<satish.la...@gmail.com>; Cc: "user"<user@spark.apache.org>; Subject: Re: Question about Parallel Stages in Spark Hello. The driver is running the individual operations in series, but each operation is parallelized internally. If you want them run in parallel you need to provide the driver a mechanism to thread the job scheduling out: val rdd1 = sc.parallelize(1 to 10) val rdd2 = sc.parallelize(1 to 20) var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par thingsToDo.foreach { case(rdd, index) => for(i <- (1 to 1)) logger.info(s"Index ${index} - ${rdd.sum()}") } This will run both operations in parallel. On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: For the below code, since rdd1 and rdd2 dont depend on each other - i was expecting that both first and second printlns would be interwoven. However - the spark job runs all "first " statements first and then all "seocnd" statements next in serial fashion. I have set spark.scheduler.mode = FAIR. obviously my understanding of parallel stages is wrong. What am I missing? val rdd1 = sc.parallelize(1 to 100) val rdd2 = sc.parallelize(1 to 100) for (i <- (1 to 100)) println("first: " + rdd1.sum()) for (i <- (1 to 100)) println("second" + rdd2.sum()) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Question about Parallel Stages in Spark
Hi I don't think so spark submit ,will receive two submits . Its will execute one submit and then to next one . If the application is multithreaded ,and two threads are calling spark submit and one time , then they will run parallel provided the scheduler is FAIR and task slots are available . But in one thread ,one submit will complete and then the another one will start . If there are independent stages in one job, then those will run parallel. I agree with Bryan Jeffrey . Regards Pralabh Kumar On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > I think the spark cluster receives two submits, A and B. > The FAIR is used to schedule A and B. > I am not sure about this. > > ---Original--- > *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com> > *Date:* 2017/6/27 08:55:42 > *To:* "satishl"<satish.la...@gmail.com>; > *Cc:* "user"<user@spark.apache.org>; > *Subject:* Re: Question about Parallel Stages in Spark > > Hello. > > The driver is running the individual operations in series, but each > operation is parallelized internally. If you want them run in parallel you > need to provide the driver a mechanism to thread the job scheduling out: > > val rdd1 = sc.parallelize(1 to 10) > val rdd2 = sc.parallelize(1 to 20) > > var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par > > thingsToDo.foreach { case(rdd, index) => > for(i <- (1 to 1)) > logger.info(s"Index ${index} - ${rdd.sum()}") > } > > > This will run both operations in parallel. > > > On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: > >> For the below code, since rdd1 and rdd2 dont depend on each other - i was >> expecting that both first and second printlns would be interwoven. >> However - >> the spark job runs all "first " statements first and then all "seocnd" >> statements next in serial fashion. I have set spark.scheduler.mode = FAIR. >> obviously my understanding of parallel stages is wrong. What am I missing? >> >> val rdd1 = sc.parallelize(1 to 100) >> val rdd2 = sc.parallelize(1 to 100) >> >> for (i <- (1 to 100)) >> println("first: " + rdd1.sum()) >> for (i <- (1 to 100)) >> println("second" + rdd2.sum()) >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in- >> Spark-tp28793.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Question about Parallel Stages in Spark
I think the spark cluster receives two submits, A and B. The FAIR is used to schedule A and B. I am not sure about this. ---Original--- From: "Bryan Jeffrey"<bryan.jeff...@gmail.com> Date: 2017/6/27 08:55:42 To: "satishl"<satish.la...@gmail.com>; Cc: "user"<user@spark.apache.org>; Subject: Re: Question about Parallel Stages in Spark Hello. The driver is running the individual operations in series, but each operation is parallelized internally. If you want them run in parallel you need to provide the driver a mechanism to thread the job scheduling out: val rdd1 = sc.parallelize(1 to 10) val rdd2 = sc.parallelize(1 to 20) var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par thingsToDo.foreach { case(rdd, index) => for(i <- (1 to 1)) logger.info(s"Index ${index} - ${rdd.sum()}") } This will run both operations in parallel. On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: For the below code, since rdd1 and rdd2 dont depend on each other - i was expecting that both first and second printlns would be interwoven. However - the spark job runs all "first " statements first and then all "seocnd" statements next in serial fashion. I have set spark.scheduler.mode = FAIR. obviously my understanding of parallel stages is wrong. What am I missing? val rdd1 = sc.parallelize(1 to 100) val rdd2 = sc.parallelize(1 to 100) for (i <- (1 to 100)) println("first: " + rdd1.sum()) for (i <- (1 to 100)) println("second" + rdd2.sum()) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Question about Parallel Stages in Spark
Hello. The driver is running the individual operations in series, but each operation is parallelized internally. If you want them run in parallel you need to provide the driver a mechanism to thread the job scheduling out: val rdd1 = sc.parallelize(1 to 10) val rdd2 = sc.parallelize(1 to 20) var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par thingsToDo.foreach { case(rdd, index) => for(i <- (1 to 1)) logger.info(s"Index ${index} - ${rdd.sum()}") } This will run both operations in parallel. On Mon, Jun 26, 2017 at 8:10 PM, satishlwrote: > For the below code, since rdd1 and rdd2 dont depend on each other - i was > expecting that both first and second printlns would be interwoven. However > - > the spark job runs all "first " statements first and then all "seocnd" > statements next in serial fashion. I have set spark.scheduler.mode = FAIR. > obviously my understanding of parallel stages is wrong. What am I missing? > > val rdd1 = sc.parallelize(1 to 100) > val rdd2 = sc.parallelize(1 to 100) > > for (i <- (1 to 100)) > println("first: " + rdd1.sum()) > for (i <- (1 to 100)) > println("second" + rdd2.sum()) > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >