Andrew Kyle Purtell created HBASE-29645:
-------------------------------------------

             Summary: AsyncBufferedMutatorImpl concurrency improvement
                 Key: HBASE-29645
                 URL: https://issues.apache.org/jira/browse/HBASE-29645
             Project: HBase
          Issue Type: Improvement
          Components: Client
    Affects Versions: 2.5.12, 2.6.3, 3.0.0-beta-1
            Reporter: Andrew Kyle Purtell
            Assignee: Andrew Kyle Purtell
             Fix For: 4.0.0-alpha-1, 3.0.0-beta-2, 2.6.4, 2.5.13


This patch modifies AsyncBufferedMutatorImpl class to improve its performance 
under concurrent usage.

While AsyncTable#batch() is largely asynchronous in nature, it can exhibit 
blocking behavior during its preparation phase, for instance, while looking up 
region locations. In the original implementation of AsyncBufferedMutatorImpl, 
calls to AsyncTable#batch() occur within a synchronized block, potentially 
causing severe contention and stalling other threads trying to buffer their 
mutations. The original implementation relied on coarse grained synchronized 
blocks for multi-threading safety, so when one thread triggered a buffer flush 
(either because the buffer was full or a periodic timer fired), all other 
threads attempting to add mutations via the mutate method would be blocked 
until the table.batch() call completed, which could take a surprisingly long 
time.

The new implementation replaces the broad synchronized blocks with a 
ReentrantLock. This lock is acquired only for the brief period needed to safely 
copy the current batch of mutations and futures into local variables and swap 
in a new internal buffer. Immediately after this quick operation, the lock is 
released. The batch() call is then executed outside of the locked section. This 
allows other threads to continue adding new mutations concurrently while the 
flushing of the previous batch proceeds independently. The client has already 
opted in to asynchronous and potentially interleaved commit of the mutations 
submitted to AsyncBufferedMutator, by definition. The minimization of critical 
section scope minimizes thread contention and significantly boosts throughput 
under load. Other related profiler driven efficiency changes are also included, 
such as elimination of stream api and array resizing hotspots identified by the 
profiler.

To validate the performance improvement of these changes, a JMH benchmark, 
AsyncBufferedMutatorBenchmark (attached to this issue), was created to measure 
the performance of the mutate method under various conditions. It focuses 
specifically on the overhead and concurrency management of 
AsyncBufferedMutatorImpl itself, not the underlying network communication. To 
achieve this, it uses the Mockito framework to create a mock AsyncTable that 
instantly returns completed futures, isolating the mutator's buffering logic 
for measurement. It runs tests with 1, 10, and 100 threads to simulate no, 
medium, and high levels of concurrency. It uses a low value (100) for 
maxMutations to force frequent flushes based on the number of mutations, and a 
very high value (100,000) to ensure flushes are rare in that measurement case. 
The benchmark measures the average time per operation in microseconds, where a 
lower score indicates better performance and higher throughput.

With a single thread and no contention the performance of both implementations 
is nearly identical. The minor variations are negligible and show that the new 
locking mechanism does not introduce any performance regression in the 
non-concurrent case. For example, with a 10MB buffer and high maxMutations, the 
NEW implementation scored 0.167 µs/op while the OLD scored 0.169 µs/op, a 
statistically insignificant difference. When the test is run with 10 threads, a 
noticeable gap appears. In the scenario designed to cause frequent flushes 
(maxMutations = 100), the NEW implementation is approximately 12 times faster 
than the OLD one (14.250 µs/op for NEW vs. 172.463 µs/op for OLD). This is 
because the OLD implementation forces threads to wait while flushes occur, and 
flushes incur a synthetic thread sleep of 1ms to simulate occasional unexpected 
blocking behavior in AsyncTable#batch(), whereas the NEW implementation allows 
them to proceed without contention. The most significant results come from the 
100-thread tests, which simulate high contention. In the frequent flush 
scenario (maxMutations = 100) the NEW implementation is 114 times faster in the 
synthetic benchmark scenario (16.123 µs/op for NEW vs. 1847.567 µs/op for OLD). 
Note that blocking IO observed in a real client for e.g. region location 
lookups can produce a much more significant impact. With the OLD code, 100 
threads are constantly competing for a lock that is held for a long duration, 
leading to a contention storm. The NEW code's reduced locking scope almost 
entirely eliminates this bottleneck.

OS: Apple Silicon (aarch64) M1 Max / 64 GB

JVM: openjdk version "17.0.11" 2024-04-16 LTS  / OpenJDK 64-Bit Server VM 
Zulu17.50+19-CA (build 17.0.11+9-LTS, mixed mode, sharing)


|Threads|Max Mutations|OLD Implementation (µs/op)|NEW Implementation 
(µs/op)|Performance Gain (OLD / NEW)|
|1|100|14.091|16.313|0.86x (comparable)|
|1|100,000|0.169|0.167|1.01x (comparable)|
|10|100|172.463|14.250|12.10x|
|10|100,000|2.465|1.072|2.30x|
|100|100|1847.567|16.123|114.59x|
|100|100,000|24.125|12.796|1.89x|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to