Re: Question about Parallel Stages in Spark

satish lalam Tue, 27 Jun 2017 10:24:40 -0700

Thanks Bryan. This is one Spark application with one job. This job has 3
stages. The first 2 are basic reads from cassandra tables and the 3rd is a
join between the two. I was expecting the first 2 stages to run in
parallel, however they run serially. Job has enough resources.


On Tue, Jun 27, 2017 at 4:03 AM, Bryan Jeffrey <bryan.jeff...@gmail.com>
wrote:

> Satish,
>
> Is this two separate applications submitted to the Yarn scheduler? If so
> then you would expect that you would see the original case run in parallel.
>
> However, if this is one application your submission to Yarn guarantees
> that this application will fairly  contend with resources requested by
> other applications. However, the internal output operations within your
> application (jobs) will be scheduled by the driver (running on a single
> AM). This means that whatever driver options and code you've set will
> impact the application, but the Yarn scheduler will not impact (beyond
> allocating cores, memory, etc. between applications.)
>
>
>
> Get Outlook for Android <https://aka.ms/ghei36>
>
>
>
>
> On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" <
> satish.la...@gmail.com> wrote:
>
> Thanks All. To reiterate - stages inside a job can be run parallely as
>> long as - (a) there is no sequential dependency (b) the job has sufficient
>> resources.
>> however, my code was launching 2 jobs and they are sequential as you
>> rightly pointed out.
>> The issue which I was trying to highlight with that piece of pseudocode
>> however was that - I am observing a job with 2 stages which dont depend on
>> each other (they both are reading data from 2 seperate tables in db), they
>> both are scheduled and both stages get resources - but the 2nd stage really
>> does not pick up until the 1st stage is complete. It might be due to the db
>> driver - I will post it to the right forum. Thanks.
>>
>> On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com>
>> wrote:
>>
>>> i think my words also misunderstood. My point is they will not submit
>>> together since they are the part of one thread.
>>>
>>> val spark =  SparkSession.builder()
>>>   .appName("practice")
>>>   .config("spark.scheduler.mode","FAIR")
>>>   .enableHiveSupport().getOrCreate()
>>> val sc = spark.sparkContext
>>> sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect()
>>> sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect()
>>> Thread.sleep(10000000)
>>>
>>>
>>> I ran this and both spark submit time are different for both the jobs .
>>>
>>> Please let me if I am wrong
>>>
>>> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>>
>>>> My words cause misunderstanding.
>>>> Step 1:A is submited to spark.
>>>> Step 2:B is submitted to spark.
>>>>
>>>> Spark gets two independent jobs.The FAIR  is used to schedule A and B.
>>>>
>>>> Jeffrey' code did not cause two submit.
>>>>
>>>>
>>>>
>>>> ---Original---
>>>> *From:* "Pralabh Kumar"<pralabhku...@gmail.com>
>>>> *Date:* 2017/6/27 12:09:27
>>>> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
>>>> *Cc:* 
>>>> "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan
>>>> Jeffrey"<bryan.jeff...@gmail.com>;
>>>> *Subject:* Re: Question about Parallel Stages in Spark
>>>>
>>>> Hi
>>>>
>>>> I don't think so spark submit ,will receive two submits .  Its will
>>>> execute one submit and then to next one .  If the application is
>>>> multithreaded ,and two threads are calling spark submit and one time , then
>>>> they will run parallel provided the scheduler is FAIR and task slots are
>>>> available .
>>>>
>>>> But in one thread ,one submit will complete and then the another one
>>>> will start . If there are independent stages in one job, then those will
>>>> run parallel.
>>>>
>>>> I agree with Bryan Jeffrey .
>>>>
>>>>
>>>> Regards
>>>> Pralabh Kumar
>>>>
>>>> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>>>
>>>>> I think the spark cluster receives two submits, A and B.
>>>>> The FAIR  is used to schedule A and B.
>>>>> I am not sure about this.
>>>>>
>>>>> ---Original---
>>>>> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com>
>>>>> *Date:* 2017/6/27 08:55:42
>>>>> *To:* "satishl"<satish.la...@gmail.com>;
>>>>> *Cc:* "user"<user@spark.apache.org>;
>>>>> *Subject:* Re: Question about Parallel Stages in Spark
>>>>>
>>>>> Hello.
>>>>>
>>>>> The driver is running the individual operations in series, but each
>>>>> operation is parallelized internally.  If you want them run in parallel 
>>>>> you
>>>>> need to provide the driver a mechanism to thread the job scheduling out:
>>>>>
>>>>> val rdd1 = sc.parallelize(1 to 100000)
>>>>> val rdd2 = sc.parallelize(1 to 200000)
>>>>>
>>>>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, 
>>>>> rdd2).zipWithIndex.par
>>>>>
>>>>> thingsToDo.foreach { case(rdd, index) =>
>>>>>   for(i <- (1 to 10000))
>>>>>     logger.info(s"Index ${index} - ${rdd.sum()}")
>>>>> }
>>>>>
>>>>>
>>>>> This will run both operations in parallel.
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> For the below code, since rdd1 and rdd2 dont depend on each other - i
>>>>>> was
>>>>>> expecting that both first and second printlns would be interwoven.
>>>>>> However -
>>>>>> the spark job runs all "first " statements first and then all "seocnd"
>>>>>> statements next in serial fashion. I have set spark.scheduler.mode =
>>>>>> FAIR.
>>>>>> obviously my understanding of parallel stages is wrong. What am I
>>>>>> missing?
>>>>>>
>>>>>>     val rdd1 = sc.parallelize(1 to 1000000)
>>>>>>     val rdd2 = sc.parallelize(1 to 1000000)
>>>>>>
>>>>>>     for (i <- (1 to 100))
>>>>>>       println("first: " + rdd1.sum())
>>>>>>     for (i <- (1 to 100))
>>>>>>       println("second" + rdd2.sum())
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar
>>>>>> k-tp28793.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: Question about Parallel Stages in Spark

Reply via email to