Re: Question about Parallel Stages in Spark

satish lalam Mon, 26 Jun 2017 23:33:32 -0700

Thanks All. To reiterate - stages inside a job can be run parallely as long
as - (a) there is no sequential dependency (b) the job has sufficient
resources.
however, my code was launching 2 jobs and they are sequential as you
rightly pointed out.
The issue which I was trying to highlight with that piece of pseudocode
however was that - I am observing a job with 2 stages which dont depend on
each other (they both are reading data from 2 seperate tables in db), they
both are scheduled and both stages get resources - but the 2nd stage really
does not pick up until the 1st stage is complete. It might be due to the db
driver - I will post it to the right forum. Thanks.


On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com>
wrote:

> i think my words also misunderstood. My point is they will not submit
> together since they are the part of one thread.
>
> val spark =  SparkSession.builder()
>   .appName("practice")
>   .config("spark.scheduler.mode","FAIR")
>   .enableHiveSupport().getOrCreate()
> val sc = spark.sparkContext
> sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect()
> sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect()
> Thread.sleep(10000000)
>
>
> I ran this and both spark submit time are different for both the jobs .
>
> Please let me if I am wrong
>
> On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>
>> My words cause misunderstanding.
>> Step 1:A is submited to spark.
>> Step 2:B is submitted to spark.
>>
>> Spark gets two independent jobs.The FAIR  is used to schedule A and B.
>>
>> Jeffrey' code did not cause two submit.
>>
>>
>>
>> ---Original---
>> *From:* "Pralabh Kumar"<pralabhku...@gmail.com>
>> *Date:* 2017/6/27 12:09:27
>> *To:* "萝卜丝炒饭"<1427357...@qq.com>;
>> *Cc:* "user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan
>> Jeffrey"<bryan.jeff...@gmail.com>;
>> *Subject:* Re: Question about Parallel Stages in Spark
>>
>> Hi
>>
>> I don't think so spark submit ,will receive two submits .  Its will
>> execute one submit and then to next one .  If the application is
>> multithreaded ,and two threads are calling spark submit and one time , then
>> they will run parallel provided the scheduler is FAIR and task slots are
>> available .
>>
>> But in one thread ,one submit will complete and then the another one will
>> start . If there are independent stages in one job, then those will run
>> parallel.
>>
>> I agree with Bryan Jeffrey .
>>
>>
>> Regards
>> Pralabh Kumar
>>
>> On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>>
>>> I think the spark cluster receives two submits, A and B.
>>> The FAIR  is used to schedule A and B.
>>> I am not sure about this.
>>>
>>> ---Original---
>>> *From:* "Bryan Jeffrey"<bryan.jeff...@gmail.com>
>>> *Date:* 2017/6/27 08:55:42
>>> *To:* "satishl"<satish.la...@gmail.com>;
>>> *Cc:* "user"<user@spark.apache.org>;
>>> *Subject:* Re: Question about Parallel Stages in Spark
>>>
>>> Hello.
>>>
>>> The driver is running the individual operations in series, but each
>>> operation is parallelized internally.  If you want them run in parallel you
>>> need to provide the driver a mechanism to thread the job scheduling out:
>>>
>>> val rdd1 = sc.parallelize(1 to 100000)
>>> val rdd2 = sc.parallelize(1 to 200000)
>>>
>>> var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, 
>>> rdd2).zipWithIndex.par
>>>
>>> thingsToDo.foreach { case(rdd, index) =>
>>>   for(i <- (1 to 10000))
>>>     logger.info(s"Index ${index} - ${rdd.sum()}")
>>> }
>>>
>>>
>>> This will run both operations in parallel.
>>>
>>>
>>> On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
>>>
>>>> For the below code, since rdd1 and rdd2 dont depend on each other - i
>>>> was
>>>> expecting that both first and second printlns would be interwoven.
>>>> However -
>>>> the spark job runs all "first " statements first and then all "seocnd"
>>>> statements next in serial fashion. I have set spark.scheduler.mode =
>>>> FAIR.
>>>> obviously my understanding of parallel stages is wrong. What am I
>>>> missing?
>>>>
>>>>     val rdd1 = sc.parallelize(1 to 1000000)
>>>>     val rdd2 = sc.parallelize(1 to 1000000)
>>>>
>>>>     for (i <- (1 to 100))
>>>>       println("first: " + rdd1.sum())
>>>>     for (i <- (1 to 100))
>>>>       println("second" + rdd2.sum())
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spar
>>>> k-tp28793.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Question about Parallel Stages in Spark

Reply via email to