neilramaswamy opened a new pull request, #47475: URL: https://github.com/apache/spark/pull/47475
### What changes were proposed in this pull request? Currently, an exception in the maintenance thread pool today can cause the entire executor to exit. This PR changes the maintenance thread pool exception logic to _only_ unload the providers that throw exceptions—the entire thread pool is not stopped. Historically, the way that we bubbled exceptions to the maintenance thread task (which manages the pool) was in the following way: - A thread in the maintenance thread pool sees an exception and it sets `threadPoolException` to be non-null - The next time that `doMaintenance()` gets called by the maintenance task (which is scheduled periodically), it checks to see whether that exception is non-null - If it is non-null, it throws that exception - It then _catches_ that exception, and stops the thread pool But now that we don't need to stop the entire maintenance thread pool and unload _all_ of the providers, when an exception is encountered in a maintenance thread pool thread, we can just have that thread unload itself. Then, we can remove the `onError` callback because it will no longer be needed. ### Why are the changes needed? Please see the JIRA for a full description of the error conditions. We don't want executors to exit because of maintenance thread pool thread failures. ### Does this PR introduce _any_ user-facing change? No. ## How was this patch tested? - All existing UTs must pass. One test was removed, though (see below). - Added a new UT where a fake state store provider is used; this state store provider throws exceptions for partitions 0 and 1. The UT asserts that these two providers become unloaded, but any other ones are _not_. There was an existing test, `SPARK-44438: maintenance task should be shutdown on error`, which actually mentions that the `SparkUncaughtExceptionHandler` should not be invoked even if there is an exception. However, this test does _not_ load more than 1 provider. Thus, the only loaded provider is the one that experiences the exception. We know that the root-cause of this issue is that if there exists _another_ provider that is waiting on a lock (i.e. on an RPC in `verifyStateStoreInstanceActive`), then that provider will receive an `InterruptedException`, which will lead to the `SparkUncaughtExceptionHandler` firing. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
