[GitHub] [spark] hvanhovell opened a new pull request #34632: [SPARK-37356][CORE] Add fine grained locking to the BlockInfoManager

GitBox Wed, 17 Nov 2021 05:19:20 -0800


hvanhovell opened a new pull request #34632:
URL: https://github.com/apache/spark/pull/34632



   ### What changes were proposed in this pull request?
   This PR moves the BlockInfoManager from a single mutex per instance to much 
more fine grained locking at the block level. Concretely this PR makes the 
following changes:
   
   - The use of `this.synchronized` for guarding against concurrent creation of 
a block has been replaced with a striped lock. We have effectively replaced a 
single coarse-grained lock with a lock per block.
   - The use of `this.synchronized` for `wait()` and `notifyAll()` has been 
replaced with per-block Conditions.
   - Extract common logic from `lockForWriting` and `lockForReading` into an 
acquireLock helper method. This deduplication is important given the size of 
the changes that this PR needed to make to the shared code.
   - Optimization: call `currentTaskAttemptId` only once per method by storing 
its result into a local variable.
   
   The PR is the first in a series of PRs. The next one will add group based 
locking and removal so we can remove broadcasts and cached RDDs in a constant 
time operation.
   
   
   ### Why are the changes needed?
   The main motivation for this change is to increase and stabilize throughput 
of clusters that run a significant number of concurrent queries. In the current 
situation these queries are fighting for the BlockInfoManager lock when they 
create broadcasts. The worse problem arrises after we do a full GC. This 
triggers clean-up of unused broadcasts. The number of broadcasts in the system 
can be significant (>> 10K) when we run under load. We need to acquire a lock 
per broadcast block. We end up with all the BM worker threads, the DAG 
Scheduler thread, and the query threads fighting for a single lock. This causes 
the throughput to drop to close to 0 during these clean-up periods.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Functionally this is covered by existing tests. The performance has been 
checked by running 32 concurrent streams and hammering a cluster with queries. 
That shows the throughput drops are mostly gone now. I am not sure how well we 
can capture this in a non invasive benchmark.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] hvanhovell opened a new pull request #34632: [SPARK-37356][CORE] Add fine grained locking to the BlockInfoManager

Reply via email to