Github user gaborgsomogyi commented on the issue:

    https://github.com/apache/spark/pull/20888
  
    `SparkStatusTracker` states the following:
    ```
     * These APIs intentionally provide very weak consistency semantics; 
consumers of these APIs should
     * be prepared to handle empty / missing information.  For example, a job's 
stage ids may be known
     * but the status API may not have any information about the details of 
those stages, so
     * `getStageInfo` could potentially return `None` for a valid stage id.
    ```
    This is reflected in the additional logs. I've wrapped `cancelStage`, 
`DataFrameRangeSuite.stageToKill = DataFrameRangeSuite.INVALID_STAGE_ID` and 
`assert(sparkContext.statusTracker.getExecutorInfos.map(_.numRunningTasks()).sum
 == 0)` statements with additional log entries and reproduced the issue. (The 
number after the colon is `stageToKill`)
    
    The log shows the following flow:
    
    1. Stage 88 was successfully cancelled
    ```
    07:17:50.862 Executor task launch worker for task 699 INFO Executor: 
Running task 0.0 in stage 88.0 (TID 699)
    07:17:50.865 spark-listener-group-shared INFO DataFrameRangeSuite: BEFORE 
CANCELLED: 88
    07:17:50.866 spark-listener-group-shared INFO DataFrameRangeSuite: AFTER 
CANCELLED: 88
    ```
    2. Waiting on the task count drop to zero on executors
    ```
    07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: BEFORE NO TASKS
    07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: AFTER NO TASKS
    ```
    3. Resetting the `stageToKill` to -1
    ```
    07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: BEFORE RESET: 88
    07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: AFTER RESET: -1
    ```
    4. Only after that the executor thread killed
    ```
    07:17:50.870 Executor task launch worker for task 699 INFO Executor: 
Executor killed task 0.0 in stage 88.0 (TID 699), reason: Stage cancelled
    07:17:50.870 task-result-getter-0 WARN TaskSetManager: Lost task 0.0 in 
stage 88.0 (TID 699, localhost, executor driver): TaskKilled (Stage cancelled)
    07:17:50.870 task-result-getter-0 INFO TaskSchedulerImpl: Removed TaskSet 
88.0, whose tasks have all completed, from pool
    ```
    In this situation the old thread had the possibility to overwrite the 
`stageToKill` variable after the reset. The new task thread not yet running to 
update the `stageToKill` to 90 so the `onJobStart` hook see the old 88 value.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to