[
https://issues.apache.org/jira/browse/HBASE-30234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
terrytlu updated HBASE-30234:
-----------------------------
Description:
Summary
-------
Several size-tracking variables in the replication source pipeline use `int`
instead of `long`,
causing integer overflow when the cumulative WAL entry batch size exceeds
Integer.MAX_VALUE (~2GB).
This results in negative values for JMX metrics (e.g., `shippedBytes`) and
incorrect throttling behavior.
Observed Symptoms
-----------------
- The RegionServer JMX metric `shippedBytes` reports negative values.
- Replication throttling may malfunction since the bandwidth calculation
receives a negative batch size.
Root Cause
----------
In `ReplicationSourceShipper.shipEdits()`, the heap size of a WAL entry batch
is cast from `long` to `int`:
int currentSize = (int) entryBatch.getHeapSize();
`WALEntryBatch.getHeapSize()` returns a `long`, but the downcast to `int`
causes silent overflow
when the value exceeds 2,147,483,647 bytes (~2GB). This truncated value
propagates through:
1. `ReplicationSource.tryThrottle(int batchSize)` — throttler receives negative
size,
producing incorrect sleep intervals.
2. `MetricsSource.shipBatch(long batchSize, int sizeInBytes)` — the
`shippedBytes` metric
is incremented by a negative value.
3. `ReplicationEndpoint.ReplicateContext.size` (int) — endpoint receives
truncated size.
4. `ReplicationSourceWALReader.sizeOfStoreFilesIncludeBulkLoad()` — accumulates
store file
sizes into an `int`, which also overflows for bulk loads with large store
files:
totalStoreFilesSize = (int) (totalStoreFilesSize +
stores.get(j).getStoreFileSizeBytes());
5. `ReplicationThrottler.getNextSleepInterval(int size)` — accepts int, loses
precision.
was:Summary ------- Several size-tracking variables in the replication source
pipeline use `int` instead of `long`, causing integer overflow when the
cumulative WAL entry batch size exceeds Integer.MAX_VALUE (~2GB). This results
in negative values for JMX metrics (e.g., `shippedBytes`) and incorrect
throttling behavior. Observed Symptoms ----------------- - The RegionServer JMX
metric `shippedBytes` reports negative values. - Replication throttling may
malfunction since the bandwidth calculation receives a negative batch size.
Root Cause ---------- In `ReplicationSourceShipper.shipEdits()`, the heap size
of a WAL entry batch is cast from `long` to `int`: int currentSize = (int)
entryBatch.getHeapSize(); `WALEntryBatch.getHeapSize()` returns a `long`, but
the downcast to `int` causes silent overflow when the value exceeds
2,147,483,647 bytes (~2GB). This truncated value propagates through: 1.
`ReplicationSource.tryThrottle(int batchSize)` — throttler receives negative
size, producing incorrect sleep intervals. 2. `MetricsSource.shipBatch(long
batchSize, int sizeInBytes)` — the `shippedBytes` metric is incremented by a
negative value. 3. `ReplicationEndpoint.ReplicateContext.size` (int) — endpoint
receives truncated size. 4.
`ReplicationSourceWALReader.sizeOfStoreFilesIncludeBulkLoad()` — accumulates
store file sizes into an `int`, which also overflows for bulk loads with large
store files: totalStoreFilesSize = (int) (totalStoreFilesSize +
stores.get(j).getStoreFileSizeBytes()); 5.
`ReplicationThrottler.getNextSleepInterval(int size)` — accepts int, loses
precision.
> Replication shippedBytes metric overflows to negative due to int truncation
> of batch size
> -----------------------------------------------------------------------------------------
>
> Key: HBASE-30234
> URL: https://issues.apache.org/jira/browse/HBASE-30234
> Project: HBase
> Issue Type: Bug
> Components: metrics, Replication
> Affects Versions: 2.4.5
> Reporter: terrytlu
> Priority: Minor
>
> Summary
> -------
> Several size-tracking variables in the replication source pipeline use `int`
> instead of `long`,
> causing integer overflow when the cumulative WAL entry batch size exceeds
> Integer.MAX_VALUE (~2GB).
> This results in negative values for JMX metrics (e.g., `shippedBytes`) and
> incorrect throttling behavior.
> Observed Symptoms
> -----------------
> - The RegionServer JMX metric `shippedBytes` reports negative values.
> - Replication throttling may malfunction since the bandwidth calculation
> receives a negative batch size.
> Root Cause
> ----------
> In `ReplicationSourceShipper.shipEdits()`, the heap size of a WAL entry batch
> is cast from `long` to `int`:
> int currentSize = (int) entryBatch.getHeapSize();
> `WALEntryBatch.getHeapSize()` returns a `long`, but the downcast to `int`
> causes silent overflow
> when the value exceeds 2,147,483,647 bytes (~2GB). This truncated value
> propagates through:
> 1. `ReplicationSource.tryThrottle(int batchSize)` — throttler receives
> negative size,
> producing incorrect sleep intervals.
> 2. `MetricsSource.shipBatch(long batchSize, int sizeInBytes)` — the
> `shippedBytes` metric
> is incremented by a negative value.
> 3. `ReplicationEndpoint.ReplicateContext.size` (int) — endpoint receives
> truncated size.
> 4. `ReplicationSourceWALReader.sizeOfStoreFilesIncludeBulkLoad()` —
> accumulates store file
> sizes into an `int`, which also overflows for bulk loads with large store
> files:
>
> totalStoreFilesSize = (int) (totalStoreFilesSize +
> stores.get(j).getStoreFileSizeBytes());
> 5. `ReplicationThrottler.getNextSleepInterval(int size)` — accepts int, loses
> precision.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)