Github user gaborgsomogyi commented on the issue:
https://github.com/apache/spark/pull/20888
`SparkStatusTracker` states the following:
```
* These APIs intentionally provide very weak consistency semantics;
consumers of these APIs should
* be prepared to handle empty / missing information. For example, a job's
stage ids may be known
* but the status API may not have any information about the details of
those stages, so
* `getStageInfo` could potentially return `None` for a valid stage id.
```
This is reflected in the additional logs. I've wrapped `cancelStage`,
`DataFrameRangeSuite.stageToKill = DataFrameRangeSuite.INVALID_STAGE_ID` and
`assert(sparkContext.statusTracker.getExecutorInfos.map(_.numRunningTasks()).sum
== 0)` statements with additional log entries and reproduced the issue. (The
number after the colon is `stageToKill`)
The log shows the following flow:
1. Stage 88 was successfully cancelled
```
07:17:50.862 Executor task launch worker for task 699 INFO Executor:
Running task 0.0 in stage 88.0 (TID 699)
07:17:50.865 spark-listener-group-shared INFO DataFrameRangeSuite: BEFORE
CANCELLED: 88
07:17:50.866 spark-listener-group-shared INFO DataFrameRangeSuite: AFTER
CANCELLED: 88
```
2. Waiting on the task count drop to zero on executors
```
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO
DataFrameRangeSuite: BEFORE NO TASKS
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO
DataFrameRangeSuite: AFTER NO TASKS
```
3. Resetting the `stageToKill` to -1
```
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO
DataFrameRangeSuite: BEFORE RESET: 88
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO
DataFrameRangeSuite: AFTER RESET: -1
```
4. Only after that the executor thread killed
```
07:17:50.870 Executor task launch worker for task 699 INFO Executor:
Executor killed task 0.0 in stage 88.0 (TID 699), reason: Stage cancelled
07:17:50.870 task-result-getter-0 WARN TaskSetManager: Lost task 0.0 in
stage 88.0 (TID 699, localhost, executor driver): TaskKilled (Stage cancelled)
07:17:50.870 task-result-getter-0 INFO TaskSchedulerImpl: Removed TaskSet
88.0, whose tasks have all completed, from pool
```
In this situation the old thread had the possibility to overwrite the
`stageToKill` variable after the reset. The new task thread not yet running to
update the `stageToKill` to 90 so the `onJobStart` hook see the old 88 value.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]