hvanhovell opened a new pull request #35991:
URL: https://github.com/apache/spark/pull/35991


   ### What changes were proposed in this pull request?
   This PR fixes a race in the `BlockInfoManager` between `unlock` and 
`releaseAllLocksForTask`, resulting in a negative reader count for a block 
(which trips an assert). This happens when the following events take place:
   
   1. [THREAD 1] calls `releaseAllLocksForTask`. This starts by collecting all 
the blocks to be unlocked for this task.
   2. [THREAD 2] calls `unlock` for a read lock for the same task (this means 
the block is also in the list collected in step 1). It then proceeds to unlock 
the block by decrementing the reader count.
   3. [THREAD 1] now starts to release the collected locks, it does this by 
decrementing the readers counts for blocks by the number of acquired read 
locks. The problem is that step 2 made the lock counts for blocks incorrect, 
and we decrement by one (or a few) too many. This triggers a negative reader 
count assert.
   
   We fix this by adding a check to `unlock` that makes sure we are not in the 
process of unlocking. We do this by checking if there is a multiset associated 
with the task that contains the read locks.
   
   
   ### Why are the changes needed?
   It is a bug. Not fixing this can cause negative reader counts for blocks, 
and this causes task failures.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Added a regression test in BlockInfoManager suite.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to