[
https://issues.apache.org/jira/browse/HBASE-25774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341447#comment-17341447
]
Bharath Vissapragada commented on HBASE-25774:
----------------------------------------------
Nice find, couldn't think of this race when reviewing HBASE-25032, my bad.
Agree that reverting it is best short term solution until we fix it cleanly.
Coming to the fix, it seems like the issue here is the definition of what
"online" means. I think we should split it into two states, something like
INITIALIZED, REGISTERED. First state means that the RS has initialized (set
during regionServerStartup()) but is waiting to be marked ready by master and
the second one means that it is ready to receive requests (set in first
report). Certain procedures (like refresh peer etc) that are interested in the
all servers while code paths like AM are interested the REGISTERED ones. We
should audit the code for usages of ServerManager carefully to make sure all
code paths are addressed, WDYT? FYI [~caroliney14]
> ServerManager.getOnlineServer may miss some region servers when refreshing
> state in some procedure implementations
> ------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-25774
> URL: https://issues.apache.org/jira/browse/HBASE-25774
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Reporter: Xiaolin Ha
> Assignee: Duo Zhang
> Priority: Critical
> Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.3, 2.3.5.1
>
>
> [https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3025/9/testReport/org.apache.hadoop.hbase.replication/TestSyncReplicationStandbyKillRS/precommit_checks___yetus_jdk8_Hadoop3_checks______/]
> {code:java}
> ...[truncated 391170 chars]...
> 76d634:45149.replicationSource,1] regionserver.HRegionServer(2351): STOPPED:
> Unexpected exception in RS:2;ece3af76d634:45149.replicationSource,1
> 2021-04-11T11:14:40,268 INFO [RS:2;ece3af76d634:45149]
> regionserver.HeapMemoryManager(218): Stopping
> 2021-04-11T11:14:40,268 INFO [MemStoreFlusher.0]
> regionserver.MemStoreFlusher$FlushHandler(384): MemStoreFlusher.0 exiting
> 2021-04-11T11:14:40,268 INFO [RS:2;ece3af76d634:45149]
> flush.RegionServerFlushTableProcedureManager(118): Stopping region server
> flush procedure manager abruptly.
> 2021-04-11T11:14:40,270 INFO [RS:2;ece3af76d634:45149]
> snapshot.RegionServerSnapshotManager(136): Stopping
> RegionServerSnapshotManager abruptly.
> 2021-04-11T11:14:40,270 INFO [RS:2;ece3af76d634:45149]
> regionserver.HRegionServer(1146): aborting server
> ece3af76d634,45149,1618139661734
> 2021-04-11T11:14:40,272 ERROR
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245]
> regionserver.ReplicationSource(428): Unexpected exception in
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> currentPath=null
> java.lang.IllegalStateException: Source should be active.
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:547)
> ~[classes/:?]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> 2021-04-11T11:14:40,272 DEBUG
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245]
> regionserver.HRegionServer(2576): Abort already in progress. Ignoring the
> current request with reason: Unexpected exception in
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> {code}
> Maybe it should use HBASE-24877 to avoid failure of the initialize of
> ReplicationSource.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)