Andrew Kyle Purtell created HBASE-29645:
-------------------------------------------
Summary: AsyncBufferedMutatorImpl concurrency improvement
Key: HBASE-29645
URL: https://issues.apache.org/jira/browse/HBASE-29645
Project: HBase
Issue Type: Improvement
Components: Client
Affects Versions: 2.5.12, 2.6.3, 3.0.0-beta-1
Reporter: Andrew Kyle Purtell
Assignee: Andrew Kyle Purtell
Fix For: 4.0.0-alpha-1, 3.0.0-beta-2, 2.6.4, 2.5.13
This patch modifies AsyncBufferedMutatorImpl class to improve its performance
under concurrent usage.
While AsyncTable#batch() is largely asynchronous in nature, it can exhibit
blocking behavior during its preparation phase, for instance, while looking up
region locations. In the original implementation of AsyncBufferedMutatorImpl,
calls to AsyncTable#batch() occur within a synchronized block, potentially
causing severe contention and stalling other threads trying to buffer their
mutations. The original implementation relied on coarse grained synchronized
blocks for multi-threading safety, so when one thread triggered a buffer flush
(either because the buffer was full or a periodic timer fired), all other
threads attempting to add mutations via the mutate method would be blocked
until the table.batch() call completed, which could take a surprisingly long
time.
The new implementation replaces the broad synchronized blocks with a
ReentrantLock. This lock is acquired only for the brief period needed to safely
copy the current batch of mutations and futures into local variables and swap
in a new internal buffer. Immediately after this quick operation, the lock is
released. The batch() call is then executed outside of the locked section. This
allows other threads to continue adding new mutations concurrently while the
flushing of the previous batch proceeds independently. The client has already
opted in to asynchronous and potentially interleaved commit of the mutations
submitted to AsyncBufferedMutator, by definition. The minimization of critical
section scope minimizes thread contention and significantly boosts throughput
under load. Other related profiler driven efficiency changes are also included,
such as elimination of stream api and array resizing hotspots identified by the
profiler.
To validate the performance improvement of these changes, a JMH benchmark,
AsyncBufferedMutatorBenchmark (attached to this issue), was created to measure
the performance of the mutate method under various conditions. It focuses
specifically on the overhead and concurrency management of
AsyncBufferedMutatorImpl itself, not the underlying network communication. To
achieve this, it uses the Mockito framework to create a mock AsyncTable that
instantly returns completed futures, isolating the mutator's buffering logic
for measurement. It runs tests with 1, 10, and 100 threads to simulate no,
medium, and high levels of concurrency. It uses a low value (100) for
maxMutations to force frequent flushes based on the number of mutations, and a
very high value (100,000) to ensure flushes are rare in that measurement case.
The benchmark measures the average time per operation in microseconds, where a
lower score indicates better performance and higher throughput.
With a single thread and no contention the performance of both implementations
is nearly identical. The minor variations are negligible and show that the new
locking mechanism does not introduce any performance regression in the
non-concurrent case. For example, with a 10MB buffer and high maxMutations, the
NEW implementation scored 0.167 µs/op while the OLD scored 0.169 µs/op, a
statistically insignificant difference. When the test is run with 10 threads, a
noticeable gap appears. In the scenario designed to cause frequent flushes
(maxMutations = 100), the NEW implementation is approximately 12 times faster
than the OLD one (14.250 µs/op for NEW vs. 172.463 µs/op for OLD). This is
because the OLD implementation forces threads to wait while flushes occur, and
flushes incur a synthetic thread sleep of 1ms to simulate occasional unexpected
blocking behavior in AsyncTable#batch(), whereas the NEW implementation allows
them to proceed without contention. The most significant results come from the
100-thread tests, which simulate high contention. In the frequent flush
scenario (maxMutations = 100) the NEW implementation is 114 times faster in the
synthetic benchmark scenario (16.123 µs/op for NEW vs. 1847.567 µs/op for OLD).
Note that blocking IO observed in a real client for e.g. region location
lookups can produce a much more significant impact. With the OLD code, 100
threads are constantly competing for a lock that is held for a long duration,
leading to a contention storm. The NEW code's reduced locking scope almost
entirely eliminates this bottleneck.
OS: Apple Silicon (aarch64) M1 Max / 64 GB
JVM: openjdk version "17.0.11" 2024-04-16 LTS / OpenJDK 64-Bit Server VM
Zulu17.50+19-CA (build 17.0.11+9-LTS, mixed mode, sharing)
|Threads|Max Mutations|OLD Implementation (µs/op)|NEW Implementation
(µs/op)|Performance Gain (OLD / NEW)|
|1|100|14.091|16.313|0.86x (comparable)|
|1|100,000|0.169|0.167|1.01x (comparable)|
|10|100|172.463|14.250|12.10x|
|10|100,000|2.465|1.072|2.30x|
|100|100|1847.567|16.123|114.59x|
|100|100,000|24.125|12.796|1.89x|
--
This message was sent by Atlassian Jira
(v8.20.10#820010)