[ 
https://issues.apache.org/jira/browse/HBASE-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Shelukhin resolved HBASE-21531.
--------------------------------------
    Resolution: Duplicate

> race between region report and region move causes master to kill RS
> -------------------------------------------------------------------
>
>                 Key: HBASE-21531
>                 URL: https://issues.apache.org/jira/browse/HBASE-21531
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> In this case the delay between the ack from RS and the region report was 
> 1.5s, so I'm not sure what caused the race (network hiccup? unreported retry 
> by protobuf transport?) but in any case I don't see anything that prevents 
> this from happening in a normal case, with a narrowed time window. Any delay 
> (e.g. a GC pause on RS right after the report is built, and ack is sent for 
> the close) or retries expands the window.
> Master starts moving the region and the source RS acks by 21:51,206
> {noformat}
> Master:
> 2018-11-21 21:21:49,024 INFO  [master/6:17000.Chore.1] master.HMaster: 
> balance hri=<region hash>, source=<RS 1>,17020,1542754626176, destination=<RS 
> 2>,17020,1542863268158
> ...
> Server:
> 2018-11-21 21:21:49,394 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
> handler.UnassignRegionHandler: Close <region hash>
> ...
> 2018-11-21 21:21:51,095 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
> handler.UnassignRegionHandler: Closed <region hash>
> {noformat}
> By then the region is removed from onlineRegions, so the master proceeds.
> {noformat}
> 2018-11-21 21:21:51,206 INFO  [PEWorker-4] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=667, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true; 
> TransitRegionStateProcedure table=<table>, region=<region hash>, REOPEN/MOVE; 
> resume parent processing.
> 2018-11-21 21:21:51,386 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=667 updating hbase:meta row=<region hash>, regionState=OPENING, 
> regionLocation=<RS 2>,17020,1542863268158
> {noformat}
> There are no obvious errors/delays that I see in RS log, and it doesn't log 
> starting to send the report.
> However, at 21:52.853 the report is processed that still contains this region.
> {noformat}
> 2018-11-21 21:21:52,853 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=48,queue=3,port=17000] 
> assignment.AssignmentManager: Killing <RS 1>,17020,1542754626176: 
> rit=OPENING, location=<RS 2>,17020,1542863268158, table=<table>, 
> region=<region hash> reported OPEN on server=<RS 1>,17020,1542754626176 but 
> state has otherwise.
> ***** ABORTING region server <RS 1>,17020,1542754626176: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPENING, location=<RS 
> 2>,17020,1542863268158, table=<table>, region=<region hash> reported OPEN on 
> server=<RS 1>,17020,1542754626176 but state has otherwise.
> {noformat}
> RS shuts down in an orderly manner and it can be seen from the log that this 
> region is actually not present (there's no line indicating it's being closed, 
> unlike for other regions).
> I think there needs to be some sort of versioning for region operations 
> and/or in RS reports to allow master to account for concurrent operations and 
> avoid races. Probably per region with either a grace period or an additional 
> global version, so that master could avoid killing RS based on stale reports, 
> but still kill RS if it did retain an old version of the region state due to 
> some bug after acking a new version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to