[jira] [Updated] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports

Sergey Shelukhin (JIRA) Wed, 30 Jan 2019 12:37:09 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-21811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sergey Shelukhin updated HBASE-21811:
-------------------------------------
    Description: 
Looks like the region server responses are being processed incorrectly in 
places allowing te region to be opened on two servers.
* The region server report handling in procedures should check which server is 
reporting.
* Also although I didn't check (and it isn't implicated in this bug), RS must 
check in OPEN that it's actually the correct RS master sent open to (w.r.t. 
start timestamp)

This was previosuly "mitigated" by master killing the RS with incorrect 
reports, but due to race conditions with reports and assignment the kill was 
replaced with a warning, so now this condition persists.
Regardless, the kill approach is not a good fix because there's still a window 
when a region can be opened on two servers.


A region is being opened by server_48c. The server dies, and we process the 
retry correctly (retry=3 because 2 previous similar open failures were 
processed correctly).
We start opening it on server_1aa now.
{noformat}
2019-01-28 18:12:09,862 INFO  [KeepAlivePEWorker-104] 
assignment.RegionStateStore: pid=4915 updating hbase:meta 
row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED
2019-01-28 18:12:09,862 INFO  [KeepAlivePEWorker-104] 
procedure.ServerCrashProcedure: pid=11944, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
hasLock=true; ServerCrashProcedure server=server_48c,17020,1548726406632, 
splitWal=true, meta=false found RIT pid=4915, ppid=7, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, 
location=server_48c,17020,1548726406632, table=table, 
region=8be2a423b16471b9417f0f7de04281c6

2019-01-28 18:12:10,778 INFO  [KeepAlivePEWorker-80] 
assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, 
ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, 
location=null
...
2019-01-28 18:12:10,902 INFO  [KeepAlivePEWorker-80] 
assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, 
location=null; forceNewPlan=true, retain=false
2019-01-28 18:12:11,114 INFO  [PEWorker-7] assignment.RegionStateStore: 
pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, 
regionState=OPENING, regionLocation=server_1aa,17020,1548727658713
{noformat}
However, we get the remote procedure failure from 48c after we've already 
started that.
It actually tried to open on the restarted RS, which makes me wonder if this is 
safe also w.r.t. other races - what if RS already initialized and didn't error 
out?
Need to check if we verify the start code expected by master on RS when opening.
{noformat}
2019-01-28 18:12:12,179 WARN  [RSProcedureDispatcher-pool4-t362] 
assignment.RegionRemoteProcedureBase: The remote operation pid=11050, 
ppid=4915, state=SUCCESS, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server 
server_48c,17020,1548726406632 failed
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
server_48c,17020,1548727752747 is not running yet
{noformat}
Without any other reason (at least logged), the RIT immediately retries again 
and chooses a new candidate.
{noformat}
2019-01-28 18:12:12,289 INFO  [KeepAlivePEWorker-100] 
assignment.TransitRegionStateProcedure: Retry=4 of max=2147483647; pid=4915, 
ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, 
location=server_1aa,17020,1548727658713
...
2019-01-28 18:12:12,668 INFO  [PEWorker-9] assignment.RegionStateStore: 
pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, 
regionState=OPENING, regionLocation=server_fd3,17020,1548727536972
{noformat}
It then retries again and goes to the new 48c, but that's unrelated.
{noformat}
2019-01-28 18:26:29,480 INFO  [KeepAlivePEWorker-154] 
assignment.RegionStateStore: pid=4915 updating hbase:meta 
row=8be2a423b16471b9417f0f7de04281c6, regionState=OPENING, 
regionLocation=server_48c,17020,1548727752747
{noformat}
What does happen though is that 1aa, that never got a chance to respond at the 
time that the RIT erroneously retried above, finishes opening the region - 
which master ignores
{noformat}
2019-01-28 18:27:42,424 WARN  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=4,port=17000] 
assignment.TransitRegionStateProcedure: Received report OPENED transition from 
server_1aa,17020,1548727658713 for rit=OPENING, 
location=server_48c,17020,1548727752747, table=table, 
region=8be2a423b16471b9417f0f7de04281c6, pid=4915 but the region is not on it, 
should be a retry, ignore
{noformat}
And starts spamming these warnings until finally the region is open in two 
places.
{noformat}
2019-01-28 18:27:45,500 WARN  
[RpcServer.default.FPBQ.Fifo.handler=42,queue=2,port=17000] 
assignment.AssignmentManager: rit=OPENING, 
location=server_48c,17020,1548727752747, table=table, 
region=8be2a423b16471b9417f0f7de04281c6 reported OPEN on 
server=server_1aa,17020,1548727658713 but state has otherwise
...
2019-01-28 18:51:21,197 WARN  
[RpcServer.default.FPBQ.Fifo.handler=10,queue=0,port=17000] 
assignment.AssignmentManager: rit=OPEN, 
location=server_48c,17020,1548727752747, table=table, 
region=8be2a423b16471b9417f0f7de04281c6 reported OPEN on 
server=server_1aa,17020,1548727658713 but state has otherwise
{noformat}

This can result in data loss.

  was:
Looks like the region server responses are being processed incorrectly in 
places allowing te region to be opened on two servers.
* The region server report handling in procedures should check which server is 
reporting.
* Also although I didn't check (and it isn't implicated in this bug), RS must 
check in OPEN that it's actually the correct RS master sent open to (w.r.t. 
start timestamp)

This was previosuly "mitigated" by master killing the RS with incorrect 
reports, but due to race conditions with reports and assignment the report was 
replaced with a warning, so now this condition persists.
Regardless, the kill approach is not a good fix because there's still a window 
when a region can be opened on two servers.


A region is being opened by server_48c. The server dies, and we process the 
retry correctly (retry=3 because 2 previous similar open failures were 
processed correctly).
We start opening it on server_1aa now.
{noformat}
2019-01-28 18:12:09,862 INFO  [KeepAlivePEWorker-104] 
assignment.RegionStateStore: pid=4915 updating hbase:meta 
row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED
2019-01-28 18:12:09,862 INFO  [KeepAlivePEWorker-104] 
procedure.ServerCrashProcedure: pid=11944, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
hasLock=true; ServerCrashProcedure server=server_48c,17020,1548726406632, 
splitWal=true, meta=false found RIT pid=4915, ppid=7, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, 
location=server_48c,17020,1548726406632, table=table, 
region=8be2a423b16471b9417f0f7de04281c6

2019-01-28 18:12:10,778 INFO  [KeepAlivePEWorker-80] 
assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, 
ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, 
location=null
...
2019-01-28 18:12:10,902 INFO  [KeepAlivePEWorker-80] 
assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, 
location=null; forceNewPlan=true, retain=false
2019-01-28 18:12:11,114 INFO  [PEWorker-7] assignment.RegionStateStore: 
pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, 
regionState=OPENING, regionLocation=server_1aa,17020,1548727658713
{noformat}
However, we get the remote procedure failure from 48c after we've already 
started that.
It actually tried to open on the restarted RS, which makes me wonder if this is 
safe also w.r.t. other races - what if RS already initialized and didn't error 
out?
Need to check if we verify the start code expected by master on RS when opening.
{noformat}
2019-01-28 18:12:12,179 WARN  [RSProcedureDispatcher-pool4-t362] 
assignment.RegionRemoteProcedureBase: The remote operation pid=11050, 
ppid=4915, state=SUCCESS, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server 
server_48c,17020,1548726406632 failed
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
server_48c,17020,1548727752747 is not running yet
{noformat}
Without any other reason (at least logged), the RIT immediately retries again 
and chooses a new candidate.
{noformat}
2019-01-28 18:12:12,289 INFO  [KeepAlivePEWorker-100] 
assignment.TransitRegionStateProcedure: Retry=4 of max=2147483647; pid=4915, 
ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
TransitRegionStateProcedure table=table, 
region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, 
location=server_1aa,17020,1548727658713
...
2019-01-28 18:12:12,668 INFO  [PEWorker-9] assignment.RegionStateStore: 
pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, 
regionState=OPENING, regionLocation=server_fd3,17020,1548727536972
{noformat}
It then retries again and goes to the new 48c, but that's unrelated.
{noformat}
2019-01-28 18:26:29,480 INFO  [KeepAlivePEWorker-154] 
assignment.RegionStateStore: pid=4915 updating hbase:meta 
row=8be2a423b16471b9417f0f7de04281c6, regionState=OPENING, 
regionLocation=server_48c,17020,1548727752747
{noformat}
What does happen though is that 1aa, that never got a chance to respond at the 
time that the RIT erroneously retried above, finishes opening the region - 
which master ignores
{noformat}
2019-01-28 18:27:42,424 WARN  
[RpcServer.default.FPBQ.Fifo.handler=29,queue=4,port=17000] 
assignment.TransitRegionStateProcedure: Received report OPENED transition from 
server_1aa,17020,1548727658713 for rit=OPENING, 
location=server_48c,17020,1548727752747, table=table, 
region=8be2a423b16471b9417f0f7de04281c6, pid=4915 but the region is not on it, 
should be a retry, ignore
{noformat}
And starts spamming these warnings until finally the region is open in two 
places.
{noformat}
2019-01-28 18:27:45,500 WARN  
[RpcServer.default.FPBQ.Fifo.handler=42,queue=2,port=17000] 
assignment.AssignmentManager: rit=OPENING, 
location=server_48c,17020,1548727752747, table=table, 
region=8be2a423b16471b9417f0f7de04281c6 reported OPEN on 
server=server_1aa,17020,1548727658713 but state has otherwise
...
2019-01-28 18:51:21,197 WARN  
[RpcServer.default.FPBQ.Fifo.handler=10,queue=0,port=17000] 
assignment.AssignmentManager: rit=OPEN, 
location=server_48c,17020,1548727752747, table=table, 
region=8be2a423b16471b9417f0f7de04281c6 reported OPEN on 
server=server_1aa,17020,1548727658713 but state has otherwise
{noformat}

This can result in data loss.


> region can be opened on two servers due to race condition with procedures and 
> server reports
> --------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21811
>                 URL: https://issues.apache.org/jira/browse/HBASE-21811
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Priority: Blocker
>
> Looks like the region server responses are being processed incorrectly in 
> places allowing te region to be opened on two servers.
> * The region server report handling in procedures should check which server 
> is reporting.
> * Also although I didn't check (and it isn't implicated in this bug), RS must 
> check in OPEN that it's actually the correct RS master sent open to (w.r.t. 
> start timestamp)
> This was previosuly "mitigated" by master killing the RS with incorrect 
> reports, but due to race conditions with reports and assignment the kill was 
> replaced with a warning, so now this condition persists.
> Regardless, the kill approach is not a good fix because there's still a 
> window when a region can be opened on two servers.
> A region is being opened by server_48c. The server dies, and we process the 
> retry correctly (retry=3 because 2 previous similar open failures were 
> processed correctly).
> We start opening it on server_1aa now.
> {noformat}
> 2019-01-28 18:12:09,862 INFO  [KeepAlivePEWorker-104] 
> assignment.RegionStateStore: pid=4915 updating hbase:meta 
> row=8be2a423b16471b9417f0f7de04281c6, regionState=ABNORMALLY_CLOSED
> 2019-01-28 18:12:09,862 INFO  [KeepAlivePEWorker-104] 
> procedure.ServerCrashProcedure: pid=11944, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure 
> server=server_48c,17020,1548726406632, splitWal=true, meta=false found RIT 
> pid=4915, ppid=7, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
> hasLock=true; TransitRegionStateProcedure table=table, 
> region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, 
> location=server_48c,17020,1548726406632, table=table, 
> region=8be2a423b16471b9417f0f7de04281c6
> 2019-01-28 18:12:10,778 INFO  [KeepAlivePEWorker-80] 
> assignment.TransitRegionStateProcedure: Retry=3 of max=2147483647; pid=4915, 
> ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, 
> location=null
> ...
> 2019-01-28 18:12:10,902 INFO  [KeepAlivePEWorker-80] 
> assignment.TransitRegionStateProcedure: Starting pid=4915, ppid=7, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=ABNORMALLY_CLOSED, 
> location=null; forceNewPlan=true, retain=false
> 2019-01-28 18:12:11,114 INFO  [PEWorker-7] assignment.RegionStateStore: 
> pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, 
> regionState=OPENING, regionLocation=server_1aa,17020,1548727658713
> {noformat}
> However, we get the remote procedure failure from 48c after we've already 
> started that.
> It actually tried to open on the restarted RS, which makes me wonder if this 
> is safe also w.r.t. other races - what if RS already initialized and didn't 
> error out?
> Need to check if we verify the start code expected by master on RS when 
> opening.
> {noformat}
> 2019-01-28 18:12:12,179 WARN  [RSProcedureDispatcher-pool4-t362] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=11050, 
> ppid=4915, state=SUCCESS, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
> {ENCODED => 8be2a423b16471b9417f0f7de04281c6 ... to server 
> server_48c,17020,1548726406632 failed
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: 
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server 
> server_48c,17020,1548727752747 is not running yet
> {noformat}
> Without any other reason (at least logged), the RIT immediately retries again 
> and chooses a new candidate.
> {noformat}
> 2019-01-28 18:12:12,289 INFO  [KeepAlivePEWorker-100] 
> assignment.TransitRegionStateProcedure: Retry=4 of max=2147483647; pid=4915, 
> ppid=7, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=8be2a423b16471b9417f0f7de04281c6, ASSIGN; rit=OPENING, 
> location=server_1aa,17020,1548727658713
> ...
> 2019-01-28 18:12:12,668 INFO  [PEWorker-9] assignment.RegionStateStore: 
> pid=4915 updating hbase:meta row=8be2a423b16471b9417f0f7de04281c6, 
> regionState=OPENING, regionLocation=server_fd3,17020,1548727536972
> {noformat}
> It then retries again and goes to the new 48c, but that's unrelated.
> {noformat}
> 2019-01-28 18:26:29,480 INFO  [KeepAlivePEWorker-154] 
> assignment.RegionStateStore: pid=4915 updating hbase:meta 
> row=8be2a423b16471b9417f0f7de04281c6, regionState=OPENING, 
> regionLocation=server_48c,17020,1548727752747
> {noformat}
> What does happen though is that 1aa, that never got a chance to respond at 
> the time that the RIT erroneously retried above, finishes opening the region 
> - which master ignores
> {noformat}
> 2019-01-28 18:27:42,424 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=29,queue=4,port=17000] 
> assignment.TransitRegionStateProcedure: Received report OPENED transition 
> from server_1aa,17020,1548727658713 for rit=OPENING, 
> location=server_48c,17020,1548727752747, table=table, 
> region=8be2a423b16471b9417f0f7de04281c6, pid=4915 but the region is not on 
> it, should be a retry, ignore
> {noformat}
> And starts spamming these warnings until finally the region is open in two 
> places.
> {noformat}
> 2019-01-28 18:27:45,500 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=42,queue=2,port=17000] 
> assignment.AssignmentManager: rit=OPENING, 
> location=server_48c,17020,1548727752747, table=table, 
> region=8be2a423b16471b9417f0f7de04281c6 reported OPEN on 
> server=server_1aa,17020,1548727658713 but state has otherwise
> ...
> 2019-01-28 18:51:21,197 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=10,queue=0,port=17000] 
> assignment.AssignmentManager: rit=OPEN, 
> location=server_48c,17020,1548727752747, table=table, 
> region=8be2a423b16471b9417f0f7de04281c6 reported OPEN on 
> server=server_1aa,17020,1548727658713 but state has otherwise
> {noformat}
> This can result in data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HBASE-21811) region can be opened on two servers due to race condition with procedures and server reports

Reply via email to