[
https://issues.apache.org/jira/browse/HBASE-25774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340016#comment-17340016
]
Duo Zhang commented on HBASE-25774:
-----------------------------------
OK, I think I found a possible race here.
We will use AbstractPeerProcedure.refreshPeer to refresh the peer cache at
region server side after we update the external peer storage at master side. In
this method, we use ServerManager.getOnlineServersList to get the region
servers which needs to be refreshed.
Here comes the problem. At region server side, the initialization sequence is
1. Call regionServerStartup
2. Initialize the regionserver, include loading external peer storage to fill
the peer cache.
3. Periodically call regionServerReport.
At master side, we will only add the region server to online server list in the
first regionServerReport call.
So in general, it is possible that, a region server initialized the peer cache
before we update the peer storage, and after we get the region servers to
fresh, it calls regionServerReport to register itself to online server list.
This does not only effect sync replication peer state refresh, but also normal
peer modification.
The fix is straight forward I think, we need to store the list for region
servers which have already called regionServerStartup, and also send
RefreshPeerProcedure to these region servers.
> TestSyncReplicationStandbyKillRS#testStandbyKillRegionServer is flaky
> ---------------------------------------------------------------------
>
> Key: HBASE-25774
> URL: https://issues.apache.org/jira/browse/HBASE-25774
> Project: HBase
> Issue Type: Improvement
> Reporter: Xiaolin Ha
> Assignee: Duo Zhang
> Priority: Major
>
> [https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3025/9/testReport/org.apache.hadoop.hbase.replication/TestSyncReplicationStandbyKillRS/precommit_checks___yetus_jdk8_Hadoop3_checks______/]
> {code:java}
> ...[truncated 391170 chars]...
> 76d634:45149.replicationSource,1] regionserver.HRegionServer(2351): STOPPED:
> Unexpected exception in RS:2;ece3af76d634:45149.replicationSource,1
> 2021-04-11T11:14:40,268 INFO [RS:2;ece3af76d634:45149]
> regionserver.HeapMemoryManager(218): Stopping
> 2021-04-11T11:14:40,268 INFO [MemStoreFlusher.0]
> regionserver.MemStoreFlusher$FlushHandler(384): MemStoreFlusher.0 exiting
> 2021-04-11T11:14:40,268 INFO [RS:2;ece3af76d634:45149]
> flush.RegionServerFlushTableProcedureManager(118): Stopping region server
> flush procedure manager abruptly.
> 2021-04-11T11:14:40,270 INFO [RS:2;ece3af76d634:45149]
> snapshot.RegionServerSnapshotManager(136): Stopping
> RegionServerSnapshotManager abruptly.
> 2021-04-11T11:14:40,270 INFO [RS:2;ece3af76d634:45149]
> regionserver.HRegionServer(1146): aborting server
> ece3af76d634,45149,1618139661734
> 2021-04-11T11:14:40,272 ERROR
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245]
> regionserver.ReplicationSource(428): Unexpected exception in
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> currentPath=null
> java.lang.IllegalStateException: Source should be active.
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:547)
> ~[classes/:?]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> 2021-04-11T11:14:40,272 DEBUG
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245]
> regionserver.HRegionServer(2576): Abort already in progress. Ignoring the
> current request with reason: Unexpected exception in
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> {code}
> Maybe it should use HBASE-24877 to avoid failure of the initialize of
> ReplicationSource.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)