[jira] [Commented] (HBASE-25774) TestSyncReplicationStandbyKillRS#testStandbyKillRegionServer is flaky

Duo Zhang (Jira) Wed, 05 May 2021 23:12:06 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-25774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340016#comment-17340016
 ]


Duo Zhang commented on HBASE-25774:
-----------------------------------

OK, I think I found a possible race here.

We will use AbstractPeerProcedure.refreshPeer to refresh the peer cache at 
region server side after we update the external peer storage at master side. In 
this method, we use ServerManager.getOnlineServersList to get the region 
servers which needs to be refreshed.

Here comes the problem. At region server side, the initialization sequence is
1. Call regionServerStartup
2. Initialize the regionserver, include loading external peer storage to fill 
the peer cache.
3. Periodically call regionServerReport.

At master side, we will only add the region server to online server list in the 
first regionServerReport call.

So in general, it is possible that, a region server initialized the peer cache 
before we update the peer storage, and after we get the region servers to 
fresh, it calls regionServerReport to register itself to online server list.

This does not only effect sync replication peer state refresh, but also normal 
peer modification.

The fix is straight forward I think, we need to store the list for region 
servers which have already called regionServerStartup, and also send 
RefreshPeerProcedure to these region servers.


> TestSyncReplicationStandbyKillRS#testStandbyKillRegionServer is flaky
> ---------------------------------------------------------------------
>
>                 Key: HBASE-25774
>                 URL: https://issues.apache.org/jira/browse/HBASE-25774
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Xiaolin Ha
>            Assignee: Duo Zhang
>            Priority: Major
>
> [https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3025/9/testReport/org.apache.hadoop.hbase.replication/TestSyncReplicationStandbyKillRS/precommit_checks___yetus_jdk8_Hadoop3_checks______/]
> {code:java}
> ...[truncated 391170 chars]...
> 76d634:45149.replicationSource,1] regionserver.HRegionServer(2351): STOPPED: 
> Unexpected exception in RS:2;ece3af76d634:45149.replicationSource,1
> 2021-04-11T11:14:40,268 INFO  [RS:2;ece3af76d634:45149] 
> regionserver.HeapMemoryManager(218): Stopping
> 2021-04-11T11:14:40,268 INFO  [MemStoreFlusher.0] 
> regionserver.MemStoreFlusher$FlushHandler(384): MemStoreFlusher.0 exiting
> 2021-04-11T11:14:40,268 INFO  [RS:2;ece3af76d634:45149] 
> flush.RegionServerFlushTableProcedureManager(118): Stopping region server 
> flush procedure manager abruptly.
> 2021-04-11T11:14:40,270 INFO  [RS:2;ece3af76d634:45149] 
> snapshot.RegionServerSnapshotManager(136): Stopping 
> RegionServerSnapshotManager abruptly.
> 2021-04-11T11:14:40,270 INFO  [RS:2;ece3af76d634:45149] 
> regionserver.HRegionServer(1146): aborting server 
> ece3af76d634,45149,1618139661734
> 2021-04-11T11:14:40,272 ERROR 
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245] 
> regionserver.ReplicationSource(428): Unexpected exception in 
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245 
> currentPath=null
> java.lang.IllegalStateException: Source should be active.
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:547)
>  ~[classes/:?]
>       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> 2021-04-11T11:14:40,272 DEBUG 
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245] 
> regionserver.HRegionServer(2576): Abort already in progress. Ignoring the 
> current request with reason: Unexpected exception in 
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> {code}
> Maybe it should use HBASE-24877 to avoid failure of the initialize of 
> ReplicationSource.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-25774) TestSyncReplicationStandbyKillRS#testStandbyKillRegionServer is flaky

Reply via email to