neilramaswamy opened a new pull request, #47475:
URL: https://github.com/apache/spark/pull/47475

   ### What changes were proposed in this pull request?
   
   Currently, an exception in the maintenance thread pool today can cause the 
entire executor to exit. This PR changes the maintenance thread pool exception 
logic to _only_ unload the providers that throw exceptions—the entire thread 
pool is not stopped.
   
   Historically, the way that we bubbled exceptions to the maintenance thread 
task (which manages the pool) was in the following way:
   
   - A thread in the maintenance thread pool sees an exception and it sets 
`threadPoolException` to be non-null
   - The next time that `doMaintenance()` gets called by the maintenance task 
(which is scheduled periodically), it checks to see whether that exception is 
non-null
   - If it is non-null, it throws that exception
   - It then _catches_ that exception, and stops the thread pool
   
   But now that we don't need to stop the entire maintenance thread pool and 
unload _all_ of the providers, when an exception is encountered in a 
maintenance thread pool thread, we can just have that thread unload itself. 
Then, we can remove the `onError` callback because it will no longer be needed.
   
   ### Why are the changes needed?
   Please see the JIRA for a full description of the error conditions. We don't 
want executors to exit because of maintenance thread pool thread failures.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ## How was this patch tested?
   
   - All existing UTs must pass. One test was removed, though (see below).
   - Added a new UT where a fake state store provider is used; this state store 
provider throws exceptions for partitions 0 and 1. The UT asserts that these 
two providers become unloaded, but any other ones are _not_. 
   
   There was an existing test, `SPARK-44438: maintenance task should be 
shutdown on error`, which actually mentions that the 
`SparkUncaughtExceptionHandler` should not be invoked even if there is an 
exception. However, this test does _not_ load more than 1 provider. Thus, the 
only loaded provider is the one that experiences the exception. We know that 
the root-cause of this issue is that if there exists _another_ provider that is 
waiting on a lock (i.e. on an RPC in `verifyStateStoreInstanceActive`), then 
that provider will receive an `InterruptedException`, which will lead to the 
`SparkUncaughtExceptionHandler` firing.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to