[jira] [Updated] (HBASE-21531) race between region report and region move causes master to kill RS

Sergey Shelukhin (JIRA) Thu, 29 Nov 2018 15:32:06 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-21531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HBASE-21531:
-------------------------------------
    Description: 
In this case the delay between the ack from RS and the region report was 1.5s, 
so I'm not sure what caused the race (network hiccup? unreported retry by 
protobuf transport?) but in any case I don't see anything that prevents this 
from happening in a normal case, with a narrowed time window. Any delay (e.g. a 
GC pause on RS right after the report is built, and ack is sent for the close) 
or retries expands the window.


Master starts moving the region and the source RS acks by 21:51,206
{noformat}
Master:
2018-11-21 21:21:49,024 INFO  [master/6:17000.Chore.1] master.HMaster: balance 
hri=<region hash>, source=<RS 1>,17020,1542754626176, destination=<RS 
2>,17020,1542863268158
...
Server:
2018-11-21 21:21:49,394 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
handler.UnassignRegionHandler: Close <region hash>
...
2018-11-21 21:21:51,095 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
handler.UnassignRegionHandler: Closed <region hash>
{noformat}
By then the region is removed from onlineRegions, so the master proceeds.
{noformat}
2018-11-21 21:21:51,206 INFO  [PEWorker-4] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=667, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true; 
TransitRegionStateProcedure table=dummy_table, region=<region hash>, 
REOPEN/MOVE; resume parent processing.
2018-11-21 21:21:51,386 INFO  [PEWorker-13] assignment.RegionStateStore: 
pid=667 updating hbase:meta row=<region hash>, regionState=OPENING, 
regionLocation=<RS 2>,17020,1542863268158
{noformat}

There are no obvious errors/delays that I see in RS log, and it doesn't log 
starting to send the report.
However, at 21:52.853 the report is processed that still contains this region.
{noformat}
2018-11-21 21:21:52,853 WARN  
[RpcServer.default.FPBQ.Fifo.handler=48,queue=3,port=17000] 
assignment.AssignmentManager: Killing <RS 1>,17020,1542754626176: rit=OPENING, 
location=<RS 2>,17020,1542863268158, table=dummy_table, region=<region hash> 
reported OPEN on server=<RS 1>,17020,1542754626176 but state has otherwise.
***** ABORTING region server <RS 1>,17020,1542754626176: 
org.apache.hadoop.hbase.YouAreDeadException: rit=OPENING, location=<RS 
2>,17020,1542863268158, table=dummy_table, region=<region hash> reported OPEN 
on server=<RS 1>,17020,1542754626176 but state has otherwise.
{noformat}

RS shuts down in an orderly manner and it can be seen from the log that this 
region is actually not present (there's no line indicating it's being closed, 
unlike for other regions).

I think there needs to be some sort of versioning for region operations and/or 
in RS reports to allow master to account for concurrent operations and avoid 
races. Probably per region with either a grace period or an additional global 
version, so that master could avoid killing RS based on stale reports, but 
still kill RS if it did retain an old version of the region state due to some 
bug after acking a new version.

  was:
In this case the delay between the ack from RS and the region report was 1.5s, 
so I'm not sure what caused the race (network hiccup? unreported retry by 
protobuf transport?) but in any case I don't see anything that prevents this 
from happening in a normal case, with a narrowed time window. Any delay (e.g. a 
GC pause on RS right after the report is built, and ack is sent for the close) 
or retries expands the window.


Master starts moving the region and the source RS acks by 21:51,206
{noformat}
Master:
2018-11-21 21:21:49,024 INFO  [master/6:17000.Chore.1] master.HMaster: balance 
hri=<region hash>, source=<RS 1>,17020,1542754626176, destination=<RS 
2>,17020,1542863268158
...
Server:
2018-11-21 21:21:49,394 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
handler.UnassignRegionHandler: Close <region hash>
...
2018-11-21 21:21:51,095 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
handler.UnassignRegionHandler: Closed <region hash>
{noformat}
By then the region is removed from onlineRegions, so the master proceeds.
{noformat}
2018-11-21 21:21:51,206 INFO  [PEWorker-4] procedure2.ProcedureExecutor: 
Finished subprocedure(s) of pid=667, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true; 
TransitRegionStateProcedure table=dummy_table, region=<region hash>, 
REOPEN/MOVE; resume parent processing.
2018-11-21 21:21:51,386 INFO  [PEWorker-13] assignment.RegionStateStore: 
pid=667 updating hbase:meta row=<region hash>, regionState=OPENING, 
regionLocation=<RS 2>,17020,1542863268158
{noformat}

There are no obvious errors/delays that I see in RS log, and it doesn't log 
starting to send the report.
However, at 21:52.853 the report is processed that still contains this region.
{noformat}
2018-11-21 21:21:52,853 WARN  
[RpcServer.default.FPBQ.Fifo.handler=48,queue=3,port=17000] 
assignment.AssignmentManager: Killing <RS 1>,17020,1542754626176: rit=OPENING, 
location=<RS 2>,17020,1542863268158, table=dummy_table, region=<region hash> 
reported OPEN on server=<RS 1>,17020,1542754626176 but state has otherwise.
***** ABORTING region server <RS 1>,17020,1542754626176: 
org.apache.hadoop.hbase.YouAreDeadException: rit=OPENING, location=<RS 
2>,17020,1542863268158, table=dummy_table, region=<region hash> reported OPEN 
on server=<RS 1>,17020,1542754626176 but state has otherwise.
{noformat}

RS shuts down in an orderly manner and it can be seen from the log that this 
region is actually not present (there's no line indicating it's being closed, 
unlike for other regions).

I think there needs to be some sort of versioning for region operations and/or 
in RS reports to allow master to account for concurrent operations and avoid 
races. Probably per region with either a grace period or an additional global 
version, so that master could avoid killing RS based on stale reports, but 
still kill RS if it did retain an old version of the region state due to some 
bug after acking a new version state.


> race between region report and region move causes master to kill RS
> -------------------------------------------------------------------
>
>                 Key: HBASE-21531
>                 URL: https://issues.apache.org/jira/browse/HBASE-21531
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> In this case the delay between the ack from RS and the region report was 
> 1.5s, so I'm not sure what caused the race (network hiccup? unreported retry 
> by protobuf transport?) but in any case I don't see anything that prevents 
> this from happening in a normal case, with a narrowed time window. Any delay 
> (e.g. a GC pause on RS right after the report is built, and ack is sent for 
> the close) or retries expands the window.
> Master starts moving the region and the source RS acks by 21:51,206
> {noformat}
> Master:
> 2018-11-21 21:21:49,024 INFO  [master/6:17000.Chore.1] master.HMaster: 
> balance hri=<region hash>, source=<RS 1>,17020,1542754626176, destination=<RS 
> 2>,17020,1542863268158
> ...
> Server:
> 2018-11-21 21:21:49,394 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
> handler.UnassignRegionHandler: Close <region hash>
> ...
> 2018-11-21 21:21:51,095 INFO  [RS_CLOSE_REGION-regionserver/<RS 1>:17020-1] 
> handler.UnassignRegionHandler: Closed <region hash>
> {noformat}
> By then the region is removed from onlineRegions, so the master proceeds.
> {noformat}
> 2018-11-21 21:21:51,206 INFO  [PEWorker-4] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=667, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_CLOSED, hasLock=true; 
> TransitRegionStateProcedure table=dummy_table, region=<region hash>, 
> REOPEN/MOVE; resume parent processing.
> 2018-11-21 21:21:51,386 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=667 updating hbase:meta row=<region hash>, regionState=OPENING, 
> regionLocation=<RS 2>,17020,1542863268158
> {noformat}
> There are no obvious errors/delays that I see in RS log, and it doesn't log 
> starting to send the report.
> However, at 21:52.853 the report is processed that still contains this region.
> {noformat}
> 2018-11-21 21:21:52,853 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=48,queue=3,port=17000] 
> assignment.AssignmentManager: Killing <RS 1>,17020,1542754626176: 
> rit=OPENING, location=<RS 2>,17020,1542863268158, table=dummy_table, 
> region=<region hash> reported OPEN on server=<RS 1>,17020,1542754626176 but 
> state has otherwise.
> ***** ABORTING region server <RS 1>,17020,1542754626176: 
> org.apache.hadoop.hbase.YouAreDeadException: rit=OPENING, location=<RS 
> 2>,17020,1542863268158, table=dummy_table, region=<region hash> reported OPEN 
> on server=<RS 1>,17020,1542754626176 but state has otherwise.
> {noformat}
> RS shuts down in an orderly manner and it can be seen from the log that this 
> region is actually not present (there's no line indicating it's being closed, 
> unlike for other regions).
> I think there needs to be some sort of versioning for region operations 
> and/or in RS reports to allow master to account for concurrent operations and 
> avoid races. Probably per region with either a grace period or an additional 
> global version, so that master could avoid killing RS based on stale reports, 
> but still kill RS if it did retain an old version of the region state due to 
> some bug after acking a new version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21531) race between region report and region move causes master to kill RS

Reply via email to