Thanks All. To reiterate - stages inside a job can be run parallely as long as - (a) there is no sequential dependency (b) the job has sufficient resources. however, my code was launching 2 jobs and they are sequential as you rightly pointed out. The issue which I was trying to highlight with that piece of pseudocode however was that - I am observing a job with 2 stages which dont depend on each other (they both are reading data from 2 seperate tables in db), they both are scheduled and both stages get resources - but the 2nd stage really does not pick up until the 1st stage is complete. It might be due to the db driver - I will post it to the right forum. Thanks.
On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > i think my words also misunderstood. My point is they will not submit > together since they are the part of one thread. > > val spark = SparkSession.builder() > .appName("practice") > .config("spark.scheduler.mode","FAIR") > .enableHiveSupport().getOrCreate() > val sc = spark.sparkContext > sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect() > sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect() > Thread.sleep(10000000) > > > I ran this and both spark submit time are different for both the jobs . > > Please let me if I am wrong > > On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > >> My words cause misunderstanding. >> Step 1:A is submited to spark. >> Step 2:B is submitted to spark. >> >> Spark gets two independent jobs.The FAIR is used to schedule A and B. >> >> Jeffrey' code did not cause two submit. >> >> >> >> ---Original--- >> *From:* "Pralabh Kumar"<pralabhku...@gmail.com> >> *Date:* 2017/6/27 12:09:27 >> *To:* "萝卜丝炒饭"<1427357...@qq.com>; >> *Cc:* "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan >> Jeffrey"<bryan.jeff...@gmail.com>; >> *Subject:* Re: Question about Parallel Stages in Spark >> >> Hi >> >> I don't think so spark submit ,will receive two submits . Its will >> execute one submit and then to next one . If the application is >> multithreaded ,and two threads are calling spark submit and one time , then >> they will run parallel provided the scheduler is FAIR and task slots are >> available . >> >> But in one thread ,one submit will complete and then the another one will >> start . If there are independent stages in one job, then those will run >> parallel. >> >> I agree with Bryan Jeffrey . >> >> >> Regards >> Pralabh Kumar >> >> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: >> >>> I think the spark cluster receives two submits, A and B. >>> The FAIR is used to schedule A and B. >>> I am not sure about this. >>> >>> ---Original--- >>> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com> >>> *Date:* 2017/6/27 08:55:42 >>> *To:* "satishl"<satish.la...@gmail.com>; >>> *Cc:* "user"<user@spark.apache.org>; >>> *Subject:* Re: Question about Parallel Stages in Spark >>> >>> Hello. >>> >>> The driver is running the individual operations in series, but each >>> operation is parallelized internally. If you want them run in parallel you >>> need to provide the driver a mechanism to thread the job scheduling out: >>> >>> val rdd1 = sc.parallelize(1 to 100000) >>> val rdd2 = sc.parallelize(1 to 200000) >>> >>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, >>> rdd2).zipWithIndex.par >>> >>> thingsToDo.foreach { case(rdd, index) => >>> for(i <- (1 to 10000)) >>> logger.info(s"Index ${index} - ${rdd.sum()}") >>> } >>> >>> >>> This will run both operations in parallel. >>> >>> >>> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote: >>> >>>> For the below code, since rdd1 and rdd2 dont depend on each other - i >>>> was >>>> expecting that both first and second printlns would be interwoven. >>>> However - >>>> the spark job runs all "first " statements first and then all "seocnd" >>>> statements next in serial fashion. I have set spark.scheduler.mode = >>>> FAIR. >>>> obviously my understanding of parallel stages is wrong. What am I >>>> missing? >>>> >>>> val rdd1 = sc.parallelize(1 to 1000000) >>>> val rdd2 = sc.parallelize(1 to 1000000) >>>> >>>> for (i <- (1 to 100)) >>>> println("first: " + rdd1.sum()) >>>> for (i <- (1 to 100)) >>>> println("second" + rdd2.sum()) >>>> >>>> >>>> >>>> -- >>>> View this message in context: http://apache-spark-user-list. >>>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar >>>> k-tp28793.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >> >