Re: Question about Parallel Stages in Spark

Bryan Jeffrey Tue, 27 Jun 2017 04:05:02 -0700

Satish,




Is this two separate applications submitted to the Yarn scheduler? If so then 
you would expect that you would see the original case run in parallel. 




However, if this is one application your submission to Yarn guarantees that 
this application will fairly  contend with resources requested by other 
applications. However, the internal output operations within your application 
(jobs) will be scheduled by the driver (running on a single AM). This means 
that whatever driver options and code you've set will impact the application, 
but the Yarn scheduler will not impact (beyond allocating cores, memory, etc. 
between applications.)








Get Outlook for Android







On Tue, Jun 27, 2017 at 2:33 AM -0400, "satish lalam" <satish.la...@gmail.com> 
wrote:










Thanks All. To reiterate - stages inside a job can be run parallely as long as 
- (a) there is no sequential dependency (b) the job has sufficient resources. 
however, my code was launching 2 jobs and they are sequential as you rightly 
pointed out.The issue which I was trying to highlight with that piece of 
pseudocode however was that - I am observing a job with 2 stages which dont 
depend on each other (they both are reading data from 2 seperate tables in db), 
they both are scheduled and both stages get resources - but the 2nd stage 
really does not pick up until the 1st stage is complete. It might be due to the 
db driver - I will post it to the right forum. Thanks.
On Mon, Jun 26, 2017 at 9:12 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote:
i think my words also misunderstood. My point is they will not submit together 
since they are the part of one thread.  
val spark =  SparkSession.builder()
  .appName("practice")
  .config("spark.scheduler.mode","FAIR")
  .enableHiveSupport().getOrCreate()
val sc = spark.sparkContext
sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect()
sc.parallelize(List(1.to(10000000))).map(s=>Thread.sleep(10000)).collect()
Thread.sleep(10000000)
I ran this and both spark submit time are different for both the jobs .
Please let me if I am wrong
On Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
My words cause misunderstanding.Step 1:A is submited to spark.Step 2:B is 
submitted to spark.
Spark gets two independent jobs.The FAIR  is used to schedule A and B.
Jeffrey' code did not cause two submit.


 ---Original---From: "Pralabh Kumar"<pralabhku...@gmail.com>Date: 2017/6/27 
12:09:27To: "萝卜丝炒饭"<1427357...@qq.com>;Cc: 
"user"<user@spark.apache.org>;"satishl"<satish.la...@gmail.com>;"Bryan 
Jeffrey"<bryan.jeff...@gmail.com>;Subject: Re: Question about Parallel Stages 
in Spark
Hi 
I don't think so spark submit ,will receive two submits .  Its will execute one 
submit and then to next one .  If the application is multithreaded ,and two 
threads are calling spark submit and one time , then they will run parallel 
provided the scheduler is FAIR and task slots are available . 
But in one thread ,one submit will complete and then the another one will start 
. If there are independent stages in one job, then those will run parallel.
I agree with Bryan Jeffrey .

RegardsPralabh Kumar
On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
I think the spark cluster receives two submits, A and B.The FAIR  is used to 
schedule A and B.I am not sure about this.
 ---Original---From: "Bryan Jeffrey"<bryan.jeff...@gmail.com>Date: 2017/6/27 
08:55:42To: "satishl"<satish.la...@gmail.com>;Cc: 
"user"<user@spark.apache.org>;Subject: Re: Question about Parallel Stages in 
Spark
Hello.
The driver is running the individual operations in series, but each operation 
is parallelized internally.  If you want them run in parallel you need to 
provide the driver a mechanism to thread the job scheduling out:
val rdd1 = sc.parallelize(1 to 100000)
val rdd2 = sc.parallelize(1 to 200000)

var thingsToDo: ParArray[(RDD[Int], Int)] = Array(rdd1, rdd2).zipWithIndex.par

thingsToDo.foreach { case(rdd, index) =>
  for(i <- (1 to 10000))
    logger.info(s"Index ${index} - ${rdd.sum()}")
}
This will run both operations in parallel.

On Mon, Jun 26, 2017 at 8:10 PM, satishl <satish.la...@gmail.com> wrote:
For the below code, since rdd1 and rdd2 dont depend on each other - i was

expecting that both first and second printlns would be interwoven. However -

the spark job runs all "first " statements first and then all "seocnd"

statements next in serial fashion. I have set spark.scheduler.mode = FAIR.

obviously my understanding of parallel stages is wrong. What am I missing?



    val rdd1 = sc.parallelize(1 to 1000000)

    val rdd2 = sc.parallelize(1 to 1000000)



    for (i <- (1 to 100))

      println("first: " + rdd1.sum())

    for (i <- (1 to 100))

      println("second" + rdd2.sum())







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Parallel-Stages-in-Spark-tp28793.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



---------------------------------------------------------------------

To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Question about Parallel Stages in Spark

Reply via email to