[
https://issues.apache.org/jira/browse/HBASE-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Shelukhin resolved HBASE-21531.
--------------------------------------
Resolution: Duplicate
> race between region report and region move causes master to kill RS
> -------------------------------------------------------------------
>
> Key: HBASE-21531
> URL: https://issues.apache.org/jira/browse/HBASE-21531
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Priority: Major
>
> In this case the delay between the ack from RS and the region report was
> 1.5s, so I'm not sure what caused the race (network hiccup? unreported retry
> by protobuf transport?) but in any case I don't see anything that prevents
> this from happening in a normal case, with a narrowed time window. Any delay
> (e.g. a GC pause on RS right after the report is built, and ack is sent for
> the close) or retries expands the window.
> Master starts moving the region and the source RS acks by 21:51,206
> {noformat}
> Master:
> 2018-11-21 21:21:49,024 INFO [master/6:17000.Chore.1] master.HMaster:
> balance hri=<region hash>, source=<RS 1>,17020,1542754626176, destination=<RS
> 2>,17020,1542863268158
> ...
> Server:
> 2018-11-21 21:21:49,394 INFO [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1]
> handler.UnassignRegionHandler: Close <region hash>
> ...
> 2018-11-21 21:21:51,095 INFO [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1]
> handler.UnassignRegionHandler: Closed <region hash>
> {noformat}
> By then the region is removed from onlineRegions, so the master proceeds.
> {noformat}
> 2018-11-21 21:21:51,206 INFO [PEWorker-4] procedure2.ProcedureExecutor:
> Finished subprocedure(s) of pid=667,
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true;
> TransitRegionStateProcedure table=<table>, region=<region hash>, REOPEN/MOVE;
> resume parent processing.
> 2018-11-21 21:21:51,386 INFO [PEWorker-13] assignment.RegionStateStore:
> pid=667 updating hbase:meta row=<region hash>, regionState=OPENING,
> regionLocation=<RS 2>,17020,1542863268158
> {noformat}
> There are no obvious errors/delays that I see in RS log, and it doesn't log
> starting to send the report.
> However, at 21:52.853 the report is processed that still contains this region.
> {noformat}
> 2018-11-21 21:21:52,853 WARN
> [RpcServer.default.FPBQ.Fifo.handler=48,queue=3,port=17000]
> assignment.AssignmentManager: Killing <RS 1>,17020,1542754626176:
> rit=OPENING, location=<RS 2>,17020,1542863268158, table=<table>,
> region=<region hash> reported OPEN on server=<RS 1>,17020,1542754626176 but
> state has otherwise.
> ***** ABORTING region server <RS 1>,17020,1542754626176:
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPENING, location=<RS
> 2>,17020,1542863268158, table=<table>, region=<region hash> reported OPEN on
> server=<RS 1>,17020,1542754626176 but state has otherwise.
> {noformat}
> RS shuts down in an orderly manner and it can be seen from the log that this
> region is actually not present (there's no line indicating it's being closed,
> unlike for other regions).
> I think there needs to be some sort of versioning for region operations
> and/or in RS reports to allow master to account for concurrent operations and
> avoid races. Probably per region with either a grace period or an additional
> global version, so that master could avoid killing RS based on stale reports,
> but still kill RS if it did retain an old version of the region state due to
> some bug after acking a new version.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)