There are a few things going on here. 1. Spark is lazy, so nothing happens until a result is collected back to the driver or data is written to a sink. So the 1 line you see is most likely just that trigger. Once triggered, all of the work required to make that final result happen occurs. If the final collect depends on 100 joins for example, you would run one line on the final result and it would trigger 100 stages.
2. A stage is not a logical boundary, it's a physical boundary between shuffle exchanges. In practice this means any time an operation requires a portion from all of the data in the previous operation. For example, doing a global sort may require data that is in every task from the previous operation to complete successfully. A job may have a theoretically unlimited number of stages although this does have some technical limits. > On Apr 21, 2022, at 9:09 AM, Joe <j...@net2020.org> wrote: > > Hi, > When looking at application UI (in Amazon EMR) I'm seeing one job for > my particular line of code, for example: > 64 Running count at MySparkJob.scala:540 > > When I click into the job and go to stages I can see over a 100 stages > running the same line of code (stages are active, pending or > completed): > 190 Pending count at MySparkJob.scala:540 > ... > 162 Active count at MySparkJob.scala:540 > ... > 108 Completed count at MySparkJob.scala:540 > ... > > I'm not sure what that means, I thought that stage was a logical > operation boundary and you could have only one stage in the job (unless > you executed the same dataset+action many times on purpose) and tasks > were the ones that were replicated across partitions. But here I'm > seeing many stages running, each with the same line of code? > > I don't have a situation where my code is re-processing the same set of > data many times, all intermediate sets are persisted. > I'm not sure if EMR UI display is wrong or if spark stages are not what > I thought they were? > Thanks, > > Joe > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org