terrytlu created HBASE-30234:
--------------------------------
Summary: Replication shippedBytes metric overflows to negative due
to int truncation of batch size
Key: HBASE-30234
URL: https://issues.apache.org/jira/browse/HBASE-30234
Project: HBase
Issue Type: Bug
Components: metrics, Replication
Affects Versions: 2.4.5
Reporter: terrytlu
Summary ------- Several size-tracking variables in the replication source
pipeline use `int` instead of `long`, causing integer overflow when the
cumulative WAL entry batch size exceeds Integer.MAX_VALUE (~2GB). This results
in negative values for JMX metrics (e.g., `shippedBytes`) and incorrect
throttling behavior. Observed Symptoms ----------------- - The RegionServer JMX
metric `shippedBytes` reports negative values. - Replication throttling may
malfunction since the bandwidth calculation receives a negative batch size.
Root Cause ---------- In `ReplicationSourceShipper.shipEdits()`, the heap size
of a WAL entry batch is cast from `long` to `int`: int currentSize = (int)
entryBatch.getHeapSize(); `WALEntryBatch.getHeapSize()` returns a `long`, but
the downcast to `int` causes silent overflow when the value exceeds
2,147,483,647 bytes (~2GB). This truncated value propagates through: 1.
`ReplicationSource.tryThrottle(int batchSize)` — throttler receives negative
size, producing incorrect sleep intervals. 2. `MetricsSource.shipBatch(long
batchSize, int sizeInBytes)` — the `shippedBytes` metric is incremented by a
negative value. 3. `ReplicationEndpoint.ReplicateContext.size` (int) — endpoint
receives truncated size. 4.
`ReplicationSourceWALReader.sizeOfStoreFilesIncludeBulkLoad()` — accumulates
store file sizes into an `int`, which also overflows for bulk loads with large
store files: totalStoreFilesSize = (int) (totalStoreFilesSize +
stores.get(j).getStoreFileSizeBytes()); 5.
`ReplicationThrottler.getNextSleepInterval(int size)` — accepts int, loses
precision.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)