Bryan Beaudreault created HBASE-26575:
-----------------------------------------
Summary: StoreHotnessProtector may block Replication
Key: HBASE-26575
URL: https://issues.apache.org/jira/browse/HBASE-26575
Project: HBase
Issue Type: Bug
Reporter: Bryan Beaudreault
I'm upgrading from hbase1 to hbase2, and I'm still in my QA environment where
load is very low. Even still, I've noticed some bad interaction between
Replication and the StoreHotnessProtector.
The ReplicationSink collects edits from the WAL and executes them in batches
via the normal HTable interface. Despite the name of this property, the max
batch sizes are based on "hbase.rpc.rows.warning.threshold" which has a default
of 5000.
The StoreHotnessProtector defaults to allowing 10 concurrent writes (of 100
columns or more) to a Store, or 20 concurrent "prepares" of said writes. The
Prepare part is what causes issues here. When a batch mutate comes in, the RS
first takes a lock on all rows in the batch. This happens in
HRegion#lockRowsAndBuildMiniBatch, and the writes are recorded as "preparing"
in StoreHotnessProtector before acquiring the lock. This recording basically
increments a counter, and throws an exception if that counter goes over 20.
Back in HRegion#lockRowsAndBuildMiniBatch, the exception is caught and recorded
in the results for any items that failed. Any items that succeed continue on to
write, unless the write is atomic, in which case it immediately throws an
exception.
This response gets back to the client, which automatically handles retries.
With enough retries, the batch call will eventually succeed because each retry
contains fewer and fewer writes to handle. Assuming you have enough retries,
this is basically enforcing an automatic chunking of of a batch write into
sub-batches of 20. Again, this only affects writes that hit more than 100
columns (by default).
At this point I'll say that this in general seems overly aggressive, especially
since the StoreHotnessProtector doesn't actually do any checks for actual load
on the RS. You could have a totally idle RegionServer and submit a single batch
of 100 Puts with 101 columns each – if you don't have at least 5 retries
configured, the batch will fail.
Back to ReplicationSink, the default batch size is 5000 Puts and the default
retries is 4. For a table with wide rows (which might cause replication to try
to sink Puts with more than 100 columns), it becomes basically impossible to
replicate because the number of retries is not nearly enough to move through a
batch of up to 5000, 20 at a time.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)