[ 
https://issues.apache.org/jira/browse/HBASE-30234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

terrytlu updated HBASE-30234:
-----------------------------
    Description: 
Summary
-------
Several size-tracking variables in the replication source pipeline use `int` 
instead of `long`, 
causing integer overflow when the cumulative WAL entry batch size exceeds 
Integer.MAX_VALUE (~2GB). 
This results in negative values for JMX metrics (e.g., `shippedBytes`) and 
incorrect throttling behavior.

Observed Symptoms
-----------------
- The RegionServer JMX metric `shippedBytes` reports negative values.
- Replication throttling may malfunction since the bandwidth calculation 
receives a negative batch size.

Root Cause
----------
In `ReplicationSourceShipper.shipEdits()`, the heap size of a WAL entry batch 
is cast from `long` to `int`:

    int currentSize = (int) entryBatch.getHeapSize();

`WALEntryBatch.getHeapSize()` returns a `long`, but the downcast to `int` 
causes silent overflow 
when the value exceeds 2,147,483,647 bytes (~2GB). This truncated value 
propagates through:

1. `ReplicationSource.tryThrottle(int batchSize)` — throttler receives negative 
size, 
   producing incorrect sleep intervals.
2. `MetricsSource.shipBatch(long batchSize, int sizeInBytes)` — the 
`shippedBytes` metric 
   is incremented by a negative value.
3. `ReplicationEndpoint.ReplicateContext.size` (int) — endpoint receives 
truncated size.
4. `ReplicationSourceWALReader.sizeOfStoreFilesIncludeBulkLoad()` — accumulates 
store file 
   sizes into an `int`, which also overflows for bulk loads with large store 
files:
   
       totalStoreFilesSize = (int) (totalStoreFilesSize + 
stores.get(j).getStoreFileSizeBytes());
5. `ReplicationThrottler.getNextSleepInterval(int size)` — accepts int, loses 
precision.

  was:Summary ------- Several size-tracking variables in the replication source 
pipeline use `int` instead of `long`, causing integer overflow when the 
cumulative WAL entry batch size exceeds Integer.MAX_VALUE (~2GB). This results 
in negative values for JMX metrics (e.g., `shippedBytes`) and incorrect 
throttling behavior. Observed Symptoms ----------------- - The RegionServer JMX 
metric `shippedBytes` reports negative values. - Replication throttling may 
malfunction since the bandwidth calculation receives a negative batch size. 
Root Cause ---------- In `ReplicationSourceShipper.shipEdits()`, the heap size 
of a WAL entry batch is cast from `long` to `int`: int currentSize = (int) 
entryBatch.getHeapSize(); `WALEntryBatch.getHeapSize()` returns a `long`, but 
the downcast to `int` causes silent overflow when the value exceeds 
2,147,483,647 bytes (~2GB). This truncated value propagates through: 1. 
`ReplicationSource.tryThrottle(int batchSize)` — throttler receives negative 
size, producing incorrect sleep intervals. 2. `MetricsSource.shipBatch(long 
batchSize, int sizeInBytes)` — the `shippedBytes` metric is incremented by a 
negative value. 3. `ReplicationEndpoint.ReplicateContext.size` (int) — endpoint 
receives truncated size. 4. 
`ReplicationSourceWALReader.sizeOfStoreFilesIncludeBulkLoad()` — accumulates 
store file sizes into an `int`, which also overflows for bulk loads with large 
store files: totalStoreFilesSize = (int) (totalStoreFilesSize + 
stores.get(j).getStoreFileSizeBytes()); 5. 
`ReplicationThrottler.getNextSleepInterval(int size)` — accepts int, loses 
precision.


> Replication shippedBytes metric overflows to negative due to int truncation 
> of batch size
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-30234
>                 URL: https://issues.apache.org/jira/browse/HBASE-30234
>             Project: HBase
>          Issue Type: Bug
>          Components: metrics, Replication
>    Affects Versions: 2.4.5
>            Reporter: terrytlu
>            Priority: Minor
>
> Summary
> -------
> Several size-tracking variables in the replication source pipeline use `int` 
> instead of `long`, 
> causing integer overflow when the cumulative WAL entry batch size exceeds 
> Integer.MAX_VALUE (~2GB). 
> This results in negative values for JMX metrics (e.g., `shippedBytes`) and 
> incorrect throttling behavior.
> Observed Symptoms
> -----------------
> - The RegionServer JMX metric `shippedBytes` reports negative values.
> - Replication throttling may malfunction since the bandwidth calculation 
> receives a negative batch size.
> Root Cause
> ----------
> In `ReplicationSourceShipper.shipEdits()`, the heap size of a WAL entry batch 
> is cast from `long` to `int`:
>     int currentSize = (int) entryBatch.getHeapSize();
> `WALEntryBatch.getHeapSize()` returns a `long`, but the downcast to `int` 
> causes silent overflow 
> when the value exceeds 2,147,483,647 bytes (~2GB). This truncated value 
> propagates through:
> 1. `ReplicationSource.tryThrottle(int batchSize)` — throttler receives 
> negative size, 
>    producing incorrect sleep intervals.
> 2. `MetricsSource.shipBatch(long batchSize, int sizeInBytes)` — the 
> `shippedBytes` metric 
>    is incremented by a negative value.
> 3. `ReplicationEndpoint.ReplicateContext.size` (int) — endpoint receives 
> truncated size.
> 4. `ReplicationSourceWALReader.sizeOfStoreFilesIncludeBulkLoad()` — 
> accumulates store file 
>    sizes into an `int`, which also overflows for bulk loads with large store 
> files:
>    
>        totalStoreFilesSize = (int) (totalStoreFilesSize + 
> stores.get(j).getStoreFileSizeBytes());
> 5. `ReplicationThrottler.getNextSleepInterval(int size)` — accepts int, loses 
> precision.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to