anishshri-db opened a new pull request, #44542:
URL: https://github.com/apache/spark/pull/44542

   ### What changes were proposed in this pull request?
   Fix deadlock between maintenance thread and streaming aggregation operator
   
   ### Why are the changes needed?
   This change fixes a race condition that causes a deadlock between the task 
thread and the maintenance thread. This is primarily only possible with the 
streaming aggregation operator. In this case, we use 2 physical operators - 
`StateStoreRestoreExec` and `StateStoreSaveExec`. The first one opens the store 
in read-only mode and the 2nd one does the actual commit.
   
   However, the following sequence of events creates an issue
   1. Task thread runs the `StateStoreRestoreExec` and gets the store instance 
and thereby the DB instance lock
   2. Maintenance thread fails with an error for some reason
   3. Maintenance thread takes the `loadedProviders` lock and tries to call 
`close` on all the loaded providers
   4. Task thread tries to execute the StateStoreRDD for the 
`StateStoreSaveExec` operator and tries to acquire the `loadedProviders` lock 
which is held by the thread above
   
   So basically if the maintenance thread is interleaved between the 
`restore/save` operations, there is a deadlock condition based on the 
`loadedProviders` lock and the DB instance lock.
   
   The fix proposes to simply release the resources at the end of the 
`StateStoreRestoreExec` operator (note that `abort` for `ReadStateStore` is 
likely a misnomer - but we choose to follow the already provided API in this 
case)
   
   Relevant Logs:
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing unit tests
   ```
   [info] Run completed in 6 minutes, 20 seconds.
   [info] Total number of tests run: 80
   [info] Suites: completed 1, aborted 0
   [info] Tests: succeeded 80, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Yes
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to