[ 
https://issues.apache.org/jira/browse/HBASE-22365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833376#comment-16833376
 ] 

Duo Zhang commented on HBASE-22365:
-----------------------------------

I think there are two ways to fix the problem.

1. Add a check when updating region state in RegionRemoteProcedureBase, under 
the read lock of ServerStateNode. If the server is dead then we go to the CRASH 
state. The advantage here is that it does not change the logic a lot, but the 
problem is that, it may introduce dead lock when meta region is also on the 
target(dead) region server, and if want to fix the dead lock, there will be 
more works to do and make the code more complicated.

2. Change the memory state right after we persist the procedure state in 
reportRegionStateTransition, and only update the state in meta later when the 
procedure is woken up to run. The advantage here is that semantically it is 
cleaner, as the region server will think the state transition is successful if 
reportRegionStateTransition returns normally. But the problem is that there 
will be more side effects as we changed the logic a lot, and we also need to 
keep the same behavior when master restarts.

I prefer solution 2, but anyway, let me provide a UT first.

> Region may be opened in two RegionServers
> -----------------------------------------
>
>                 Key: HBASE-22365
>                 URL: https://issues.apache.org/jira/browse/HBASE-22365
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 3.0.0, 2.2.0, 2.3.0
>            Reporter: Guanghao Zhang
>            Assignee: Duo Zhang
>            Priority: Blocker
>
> Found this problem when run ITBLL with our internal branch which is based on 
> branch-2.2. So mark this as a blocker for 2.2.0. A region 
> 7ebdca9cd09e26074749b546586e2156 is moved from RS-st99 to RS-st98 and the 
> TRSP succeed. Meanwhile, RS-st99 crashed and schedule a new SCP for RS-st99. 
> So SCP initialized subprocedures forĀ 7ebdca9cd09e26074749b546586e2156, too. 
> Then theĀ 7ebdca9cd09e26074749b546586e2156 was assigned to two RegionServers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to