[ 
https://issues.apache.org/jira/browse/HBASE-21421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671519#comment-16671519
 ] 

Duo Zhang commented on HBASE-21421:
-----------------------------------

Yes I think this is possible, not only because of network lag, 
reportRegionStateTransition and regionServerReport are in different threads so 
there could be race that regionServerRerpot get the snapshot of all the regions 
on the RS, and before it actually send the request to master, the 
reportRegionStateTransition finishes.

Then the problem here will become that, do we still need this check in 
regionServerReport? Since it could have inconsistency...

> Do not kill RS if reportOnlineRegions fails
> -------------------------------------------
>
>                 Key: HBASE-21421
>                 URL: https://issues.apache.org/jira/browse/HBASE-21421
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.1, 2.0.2
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21421.branch-2.0.001.patch
>
>
> In the periodic regionServerReport from RS to master, we will call 
> master.getAssignmentManager().reportOnlineRegions() to make sure the RS has a 
> same state with Master. If RS holds a region which master think should be on 
> another RS, the Master will kill the RS.
> But, the regionServerReport could be lagging(due to network or something), 
> which can't represent the current state of RegionServer. Besides, we will 
> call reportRegionStateTransition and try forever until it successfully 
> reported to master  when online a region. We can count on 
> reportRegionStateTransition calls.
> I have encountered cases that the regions are closed on the RS and  
> reportRegionStateTransition to master successfully. But later, a lagging 
> regionServerReport tells the master the region is online on the RS(Which is 
> not at the moment, this call may generated some time ago and delayed by 
> network somehow), the the master think the region should be on another RS, 
> and kill the RS, which should not be.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to