Thanks Bryan. This is one Spark application with one job. This job has 3 stages. The first 2 are basic reads from cassandra tables and the 3rd is a join between the two. I was expecting the first 2 stages to run in parallel, however they run serially. Job has enough resources.
On Tue, Jun 27, 2017 at 4:03 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Satish, > > Is this two separate applications submitted to the Yarn scheduler? If so > then you would expect that you would see the original case run in parallel. > > However, if this is one application your submission to Yarn guarantees > that this application will fairly contend with resources requested by > other applications. However, the internal output operations within your > application (jobs) will be scheduled by the driver (running on a single > AM). This means that whatever driver options and code you've set will > impact the application, but the Yarn scheduler will not impact (beyond > allocating cores, memory, etc. between applications.) > > > > Get Outlook for Android <https://aka.ms/ghei36> > > > > > On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" < > satish.la...@gmail.com> wrote: > > Thanks All. To reiterate - stages inside a job can be run parallely as >> long as - (a) there is no sequential dependency (b) the job has sufficient >> resources. >> however, my code was launching 2 jobs and they are sequential as you >> rightly pointed out. >> The issue which I was trying to highlight with that piece of pseudocode >> however was that - I am observing a job with 2 stages which dont depend on >> each other (they both are reading data from 2 seperate tables in db), they >> both are scheduled and both stages get resources - but the 2nd stage really >> does not pick up until the 1st stage is complete. It might be due to the db >> driver - I will post it to the right forum. Thanks. >> >> On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com> >> wrote: >> >>> i think my words also misunderstood. My point is they will not submit >>> together since they are the part of one thread. >>> >>> val spark = SparkSession.builder() >>> .appName("practice") >>> .config("spark.scheduler.mode","FAIR") >>> .enableHiveSupport().getOrCreate() >>> val sc = spark.sparkContext >>> sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect() >>> sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect() >>> Thread.sleep(10000000) >>> >>> >>> I ran this and both spark submit time are different for both the jobs . >>> >>> Please let me if I am wrong >>> >>> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: >>> >>>> My words cause misunderstanding. >>>> Step 1:A is submited to spark. >>>> Step 2:B is submitted to spark. >>>> >>>> Spark gets two independent jobs.The FAIR is used to schedule A and B. >>>> >>>> Jeffrey' code did not cause two submit. >>>> >>>> >>>> >>>> ---Original--- >>>> *From:* "Pralabh Kumar"<pralabhku...@gmail.com> >>>> *Date:* 2017/6/27 12:09:27 >>>> *To:* "萝卜丝炒饭"<1427357...@qq.com>; >>>> *Cc:* >>>> "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan >>>> Jeffrey"<bryan.jeff...@gmail.com>; >>>> *Subject:* Re: Question about Parallel Stages in Spark >>>> >>>> Hi >>>> >>>> I don't think so spark submit ,will receive two submits . Its will >>>> execute one submit and then to next one . If the application is >>>> multithreaded ,and two threads are calling spark submit and one time , then >>>> they will run parallel provided the scheduler is FAIR and task slots are >>>> available . >>>> >>>> But in one thread ,one submit will complete and then the another one >>>> will start . If there are independent stages in one job, then those will >>>> run parallel. >>>> >>>> I agree with Bryan Jeffrey . >>>> >>>> >>>> Regards >>>> Pralabh Kumar >>>> >>>> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: >>>> >>>>> I think the spark cluster receives two submits, A and B. >>>>> The FAIR is used to schedule A and B. >>>>> I am not sure about this. >>>>> >>>>> ---Original--- >>>>> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com> >>>>> *Date:* 2017/6/27 08:55:42 >>>>> *To:* "satishl"<satish.la...@gmail.com>; >>>>> *Cc:* "user"<user@spark.apache.org>; >>>>> *Subject:* Re: Question about Parallel Stages in Spark >>>>> >>>>> Hello. >>>>> >>>>> The driver is running the individual operations in series, but each >>>>> operation is parallelized internally. If you want them run in parallel >>>>> you >>>>> need to provide the driver a mechanism to thread the job scheduling out: >>>>> >>>>> val rdd1 = sc.parallelize(1 to 100000) >>>>> val rdd2 = sc.parallelize(1 to 200000) >>>>> >>>>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, >>>>> rdd2).zipWithIndex.par >>>>> >>>>> thingsToDo.foreach { case(rdd, index) => >>>>> for(i <- (1 to 10000)) >>>>> logger.info(s"Index ${index} - ${rdd.sum()}") >>>>> } >>>>> >>>>> >>>>> This will run both operations in parallel. >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> >>>>> wrote: >>>>> >>>>>> For the below code, since rdd1 and rdd2 dont depend on each other - i >>>>>> was >>>>>> expecting that both first and second printlns would be interwoven. >>>>>> However - >>>>>> the spark job runs all "first " statements first and then all "seocnd" >>>>>> statements next in serial fashion. I have set spark.scheduler.mode = >>>>>> FAIR. >>>>>> obviously my understanding of parallel stages is wrong. What am I >>>>>> missing? >>>>>> >>>>>> val rdd1 = sc.parallelize(1 to 1000000) >>>>>> val rdd2 = sc.parallelize(1 to 1000000) >>>>>> >>>>>> for (i <- (1 to 100)) >>>>>> println("first: " + rdd1.sum()) >>>>>> for (i <- (1 to 100)) >>>>>> println("second" + rdd2.sum()) >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: http://apache-spark-user-list. >>>>>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar >>>>>> k-tp28793.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com. >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>>> >>>> >>> >>