Neil Ramaswamy created SPARK-48997:
--------------------------------------

             Summary: Maintenance thread pool error should not cause the entire 
executor to crash
                 Key: SPARK-48997
                 URL: https://issues.apache.org/jira/browse/SPARK-48997
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 4.0.0
            Reporter: Neil Ramaswamy


Today, it's possible for an exception within a thread in the maintenance pool 
to cause the entire executor to crash. Here's how:
 # An error occurs in a maintenance pool thread
 # It gets passed to the maintenance task thread, which `throw`s it
 # That gets caught by `onError`, which `.stop()`s the maintenance thread pool
 # If any of the maintenance pool threads are waiting on a lock, they will 
receive an `InterruptedException` (this happens if they are verifying if the 
their state store instance is active)
 # This `InterruptedException` is not caught, which is not `NonFatal`
 # This uncaught exception bubbles all the way to the 
`SparkUncaughtExceptionHandler`, causing the executor to exit

A fix that is better is to modify the maintenance thread pool to only `unload` 
providers that experience errors, not stop the entire thread pool.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to