Neil Ramaswamy created SPARK-48997:
--------------------------------------
Summary: Maintenance thread pool error should not cause the entire
executor to crash
Key: SPARK-48997
URL: https://issues.apache.org/jira/browse/SPARK-48997
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Neil Ramaswamy
Today, it's possible for an exception within a thread in the maintenance pool
to cause the entire executor to crash. Here's how:
# An error occurs in a maintenance pool thread
# It gets passed to the maintenance task thread, which `throw`s it
# That gets caught by `onError`, which `.stop()`s the maintenance thread pool
# If any of the maintenance pool threads are waiting on a lock, they will
receive an `InterruptedException` (this happens if they are verifying if the
their state store instance is active)
# This `InterruptedException` is not caught, which is not `NonFatal`
# This uncaught exception bubbles all the way to the
`SparkUncaughtExceptionHandler`, causing the executor to exit
A fix that is better is to modify the maintenance thread pool to only `unload`
providers that experience errors, not stop the entire thread pool.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]