[
https://issues.apache.org/jira/browse/HBASE-28620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
MisterWang updated HBASE-28620:
-------------------------------
Description:
When peer changes, replication closes the reader and shipper created earlier.
However, after the specified timeout, the shipper still does not automatically
close (It was interrupted, but it didn't close properly). The existing code
simply returns without releasing quota. Not cleaning buffer usage.
In one practice of my company, in this case, the quota was full because it was
not released in time, so wal reader could not continue read new data and
replication had a backlog.
The log is as follows:
2024-05-20 20:00:00,796 WARN
[RpcServer.default.FPRWQ.Fifo.read.handler=70,queue=1,port=16020]
regionserver.ReplicationSourceShipper: Shipper clearWALEntryBatch method timed
out whilst waiting reader/shipper thread to stop. Not cleaning buffer usage.
Shipper alive: peer1; Reader alive: false
2024-05-20 20:00:01,351 WARN peer=peer1, can't read more edits from WAL as
buffer usage 268435456B exceeds limit 268435456B
was:
When peer changes, replication closes the reader and shipper created earlier.
However, after the specified timeout, the shipper still does not automatically
close. The existing code simply returns without releasing quota. Not cleaning
buffer usage.
In one practice of my company, in this case, the quota was full because it was
not released in time, so wal reader could not continue read new data and
replication had a backlog.
The log is as follows:
2024-05-20 20:00:00,796 WARN
[RpcServer.default.FPRWQ.Fifo.read.handler=70,queue=1,port=16020]
regionserver.ReplicationSourceShipper: Shipper clearWALEntryBatch method timed
out whilst waiting reader/shipper thread to stop. Not cleaning buffer usage.
Shipper alive: peer1; Reader alive: false
2024-05-20 20:00:01,351 WARN peer=peer1, can't read more edits from WAL as
buffer usage 268435456B exceeds limit 268435456B
> replication quota leak when peer changes
> ----------------------------------------
>
> Key: HBASE-28620
> URL: https://issues.apache.org/jira/browse/HBASE-28620
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: MisterWang
> Priority: Critical
> Labels: pull-request-available
>
> When peer changes, replication closes the reader and shipper created earlier.
> However, after the specified timeout, the shipper still does not
> automatically close (It was interrupted, but it didn't close properly). The
> existing code simply returns without releasing quota. Not cleaning buffer
> usage.
> In one practice of my company, in this case, the quota was full because it
> was not released in time, so wal reader could not continue read new data and
> replication had a backlog.
>
> The log is as follows:
> 2024-05-20 20:00:00,796 WARN
> [RpcServer.default.FPRWQ.Fifo.read.handler=70,queue=1,port=16020]
> regionserver.ReplicationSourceShipper: Shipper clearWALEntryBatch method
> timed out whilst waiting reader/shipper thread to stop. Not cleaning buffer
> usage. Shipper alive: peer1; Reader alive: false
> 2024-05-20 20:00:01,351 WARN peer=peer1, can't read more edits from WAL as
> buffer usage 268435456B exceeds limit 268435456B
--
This message was sent by Atlassian Jira
(v8.20.10#820010)