[jira] [Comment Edited] (HBASE-21873) region can be assigned to 2 servers due to a timed-out call or an unknown exception

Sergey Shelukhin (JIRA) Tue, 12 Feb 2019 11:30:14 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766351#comment-16766351
 ]


Sergey Shelukhin edited comment on HBASE-21873 at 2/12/19 7:29 PM:
-------------------------------------------------------------------

W.r.t. "the only safe option" as I mentioned elsewhere it's a spec issue 
fundamentally. Master thinks the region is closed, but it's actually opening... 
state on the master is incorrect.
So there are two high level ways to fix it 
1) *Address the state issue by adding another state/another transition to the 
procedures - UNKNOWN state (or reuse CLOSING state)*, and confirm the region is 
closed. Given potential timing issues on network and with RS RPC queues, the 
request in flight would also need to be fenced so it would probably require 
region protocol versioning that I am going to add at some point when I'm done 
investigating other issues ;)
This is a proper and least disruptive, but also a more complex way to fix - we 
go from OPENING, to UNKNOWN, then in normal case confirm it's CLOSED, and retry 
OPENING elsewhere.
Or in abnormal case like the one I reported, we find out it's OPENED, and we 
are good.

2) *Make sure incorrect state in master is always safe*. As in original issue, 
assuming CLOSED and retrying on a different server is not safe.
Retrying OPENING on the same server is better but still not safe (esp. with 
potential code issues with assignment handler and stuff).
One safe way is to assume region is OPEN and stop assignment. That way region 
will be offline, but there will be no double assignment, and any other 
transition will have to do CONFIRM_CLOSED on the server. 
To solve the problem that the region is offline without breaking the 
no-double-assignment constraint, a simple thing we can do is kill RS to resolve 
the ambiguity...  however, trying to kill RS may not work because we are facing 
network errors - so it could be that RS is already dead, or because there's 
network partition/etc. that causes network timeouts in the first place. So, we 
can instead put a flag on RS and only kill it when it comes back. If we keep 
the flag on for too long, the worst case scenario is that this one region is 
offline. Now, if the RS is in fact already dead, SCP will process it eventually 
and reassign this fake-OPEN region, so we are good.
If RS is just having some temporary issues (e.g. network partition), when it 
comes back with a report, we will kill it because of the flag and again SCP 
will take care of the region.

A *hybrid approach* possible is to assume region is OPEN, but immediately 
trigger unassignment/move for this region. That way CONFIRM_CLOSED will be 
processed on the server, as usual.
Another approach is improved (2), where we don't kill RS but only move the 
region if the server comes back and doesn't have it. However due to potential 
requests in flight that would be much more involved.



was (Author: sershe):
W.r.t. "the only safe option" as I mentioned elsewhere it's a spec issue 
fundamentally. Master thinks the region is closed, but it's actually opening... 
state on the master is incorrect.
So there are two high level ways to fix it 
1) *Address the state issue by adding another state/another transition to the 
procedures - UNKNOWN state (or reuse CLOSING state)*, and confirm the region is 
closed. Given potential timing issues on network and with RS RPC queues, the 
request in flight would also need to be fenced so it would probably require 
region protocol versioning that I am going to add at some point when I'm done 
investigating other issues ;)
This is a proper and least disruptive, but also a more complex way to fix - we 
go from OPENING, to UNKNOWN, then in normal case confirm it's CLOSED, and retry 
OPENING elsewhere.
Or in abnormal case like the one I reported, we find out it's OPENED, and we 
are good.

2) *Make sure incorrect state in master is always safe*. As in original issue, 
assuming CLOSED and retrying on a different server is not safe.
Retrying OPENING on the same server is better but still not safe (esp. with 
potential code issues with assignment handler and stuff).
One safe way is to assume region is OPEN and stop assignment, but mark server 
with a flag. That way region will be offline, but there will be no double 
assignment, and any other transition will have to do CONFIRM_CLOSED on the 
server. 
To solve the problem that the region is offline without breaking the 
no-double-assignment constraint, a simple thing we can do is kill RS to resolve 
the ambiguity...  however, trying to kill RS may not work because we are facing 
network errors - so it could be that RS is already dead, or because there's 
network partition/etc. that causes network timeouts in the first place. So, we 
can instead put a flag on RS and only kill it when it comes back. If we keep 
the flag on for too long, the worst case scenario is that this one region is 
offline. Now, if the RS is in fact already dead, SCP will process it eventually 
and reassign this fake-OPEN region, so we are good.
If RS is just having some temporary issues (e.g. network partition), when it 
comes back with a report, we will kill it because of the flag and again SCP 
will take care of the region.

A *hybrid approach* possible is to assume region is OPEN, but immediately 
trigger unassignment/move for this region. That way CONFIRM_CLOSED will be 
processed on the server, as usual.
Another approach is improved (2), where we don't kill RS but only move the 
region if the server comes back and doesn't have it. However due to potential 
requests in flight that would be much more involved.


> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-21873
>                 URL: https://issues.apache.org/jira/browse/HBASE-21873
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Blocker
>             Fix For: 3.0.0, 2.2.0, 2.3.0
>
>         Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=60000^M
>         at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=60000^M
>         at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
>         ... 4 more^M
>     {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null; 
> forceNewPlan=true, retain=false
> 2019-02-08 03:22:33,238 INFO  [PEWorker-7] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server2,17020,1549569075319
> {noformat}
> The ignore-message
> {noformat}
> 2019-02-08 03:25:44,754 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=34,queue=4,port=17000] 
> assignment.TransitRegionStateProcedure: Received report OPENED transition 
> from server1,17020,1549623714617 for rit=OPENING, 
> location=server2,17020,1549569075319, table=table, 
> region=d0214809147e43dc6870005742d5d204, pid=260626 but the region is not on 
> it, should be a retry, ignore
> {noformat}
> The 2nd assignment
> {noformat}
> 2019-02-08 03:26:18,915 INFO  [PEWorker-7] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; resume parent processing.
> 2019-02-08 03:26:18,971 INFO  [PEWorker-11] procedure2.ProcedureExecutor: 
> Finished pid=260626, ppid=260595, state=SUCCESS, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN in 4mins, 47.347sec
> {noformat}
> ======= by Duo Zhang ======
> The actual problem here is that, in IPCUtil.wrapException, we want to add the 
> remote server address in the exception message so it will be easier for 
> debugging, and there are several instanceof checks in it which is for keeping 
> the original exception type, since upper layer may depend on the exception 
> type for error recovery. But we do not check for CallTimeoutException in this 
> method so it will be wrapped by an IOException, which makes the code in 
> RSProcedureDispatcher broken, and causes the double assign.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21873) region can be assigned to 2 servers due to a timed-out call or an unknown exception

Reply via email to