[ 
https://issues.apache.org/jira/browse/HBASE-25774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341447#comment-17341447
 ] 

Bharath Vissapragada commented on HBASE-25774:
----------------------------------------------

Nice find, couldn't think of this race when reviewing HBASE-25032, my bad. 
Agree that reverting it is best short term solution until we fix it cleanly. 
Coming to the fix, it seems like the issue here is the definition of what 
"online" means. I think we should split it into two states, something like 
INITIALIZED, REGISTERED. First state means that the RS has initialized (set 
during regionServerStartup()) but is waiting to be marked ready by master and 
the second one  means that it is ready to receive requests (set in first 
report). Certain procedures (like refresh peer etc) that are interested in the 
all servers while code paths like AM are interested the REGISTERED ones. We 
should audit the code for usages of ServerManager carefully to make sure all 
code paths are addressed, WDYT? FYI [~caroliney14]

> ServerManager.getOnlineServer may miss some region servers when refreshing 
> state in some procedure implementations
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-25774
>                 URL: https://issues.apache.org/jira/browse/HBASE-25774
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>            Reporter: Xiaolin Ha
>            Assignee: Duo Zhang
>            Priority: Critical
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.3, 2.3.5.1
>
>
> [https://ci-hadoop.apache.org/job/HBase/job/HBase-PreCommit-GitHub-PR/job/PR-3025/9/testReport/org.apache.hadoop.hbase.replication/TestSyncReplicationStandbyKillRS/precommit_checks___yetus_jdk8_Hadoop3_checks______/]
> {code:java}
> ...[truncated 391170 chars]...
> 76d634:45149.replicationSource,1] regionserver.HRegionServer(2351): STOPPED: 
> Unexpected exception in RS:2;ece3af76d634:45149.replicationSource,1
> 2021-04-11T11:14:40,268 INFO  [RS:2;ece3af76d634:45149] 
> regionserver.HeapMemoryManager(218): Stopping
> 2021-04-11T11:14:40,268 INFO  [MemStoreFlusher.0] 
> regionserver.MemStoreFlusher$FlushHandler(384): MemStoreFlusher.0 exiting
> 2021-04-11T11:14:40,268 INFO  [RS:2;ece3af76d634:45149] 
> flush.RegionServerFlushTableProcedureManager(118): Stopping region server 
> flush procedure manager abruptly.
> 2021-04-11T11:14:40,270 INFO  [RS:2;ece3af76d634:45149] 
> snapshot.RegionServerSnapshotManager(136): Stopping 
> RegionServerSnapshotManager abruptly.
> 2021-04-11T11:14:40,270 INFO  [RS:2;ece3af76d634:45149] 
> regionserver.HRegionServer(1146): aborting server 
> ece3af76d634,45149,1618139661734
> 2021-04-11T11:14:40,272 ERROR 
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245] 
> regionserver.ReplicationSource(428): Unexpected exception in 
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245 
> currentPath=null
> java.lang.IllegalStateException: Source should be active.
>       at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.initialize(ReplicationSource.java:547)
>  ~[classes/:?]
>       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> 2021-04-11T11:14:40,272 DEBUG 
> [ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245] 
> regionserver.HRegionServer(2576): Abort already in progress. Ignoring the 
> current request with reason: Unexpected exception in 
> ReplicationExecutor-0.replicationSource,1-ece3af76d634,44745,1618139625245
> {code}
> Maybe it should use HBASE-24877 to avoid failure of the initialize of 
> ReplicationSource.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to