Steven Tran created SPARK-54217:
-----------------------------------
Summary: PythonRunner does not synchronize Python worker
kill/release decisions
Key: SPARK-54217
URL: https://issues.apache.org/jira/browse/SPARK-54217
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 4.2.0
Reporter: Steven Tran
`PythonWorkerFactory` in daemon mode will facilitate worker reuse, where
possible, as long as the worker successfully completed its last-assigned task
(via `releasePythonWorker`). The worker will be released into the idle queue to
be picked up by the next `createPythonWorker` call.
However, there is a race condition that can result in a released worker in
the`PythonWorkerFactory` idle queue getting killed. i.e. the `PythonRunner`
lacks synchronization between:
* the main task thread's decision to release its associated Python worker
(when work is complete), and
* the `MonitorThread`'s decision to kill the associated Python worker (when
requested by the executor, e.g. speculative execution where another attempt
succeeds).
So, the following sequence of events is possible:
# `PythonRunner` is running
# The Python worker finishes its work and writes `END_OF_STREAM` to signal
back to `PythonRunner`'s main task thread that it is done
# `PythonRunner`'s main task thread receives this instruction and releases the
worker for reuse
# Executor decides to kill this task (e.g. speculative execution)
# `PythonRunner`'s `MonitorThread` receives this instruction and kills the
already-relinquished `PythonWorker`
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]