Re: Why is spark running multiple stages with the same code line?

Russell Spitzer Thu, 21 Apr 2022 07:27:05 -0700

There are a few things going on here.

1. Spark is lazy, so nothing happens until a result is collected back to the 
driver or data is written to a sink. So the 1 line you see 
is most likely just that trigger. Once triggered, all of the work required to 
make that final result happen occurs. If the final collect depends on 100 joins 
for example, you would run one line on the final result and it would trigger 
100 stages.


2. A stage is not a logical boundary, it's a physical boundary between shuffle 
exchanges. In practice this means any time an operation requires a portion from 
all of the data in the previous operation. For example, doing a global sort may 
require data that is in every task from the previous operation to complete 
successfully. A job may have a theoretically unlimited number of stages 
although this does have some technical limits.

> On Apr 21, 2022, at 9:09 AM, Joe <j...@net2020.org> wrote:
> 
> Hi,
> When looking at application UI (in Amazon EMR) I'm seeing one job for
> my particular line of code, for example:
> 64 Running count at MySparkJob.scala:540
> 
> When I click into the job and go to stages I can see over a 100 stages
> running the same line of code (stages are active, pending or
> completed):
> 190 Pending count at MySparkJob.scala:540
> ...
> 162 Active count at MySparkJob.scala:540
> ...
> 108 Completed count at MySparkJob.scala:540
> ...
> 
> I'm not sure what that means, I thought that stage was a logical
> operation boundary and you could have only one stage in the job (unless
> you executed the same dataset+action many times on purpose) and tasks
> were the ones that were replicated across partitions. But here I'm
> seeing many stages running, each with the same line of code?
> 
> I don't have a situation where my code is re-processing the same set of
> data many times, all intermediate sets are persisted.
> I'm not sure if EMR UI display is wrong or if spark stages are not what
> I thought they were?
> Thanks,
> 
> Joe
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why is spark running multiple stages with the same code line?

Reply via email to