[jira] [Comment Edited] (HBASE-21873) region can be assigned to 2 servers due to a timed-out call or an unknown exception

Sergey Shelukhin (JIRA) Tue, 12 Feb 2019 18:34:50 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766677#comment-16766677
 ]


Sergey Shelukhin edited comment on HBASE-21873 at 2/13/19 2:33 AM:
-------------------------------------------------------------------

This is  not about changing from procedures etc - that is a separate JIRA, out 
of the scope of this JIRA.
Also, double checking AssignRegionHandlerI do see that it checks the state, so 
that should be fine.

What I'm saying in this JIRA is that, keeping procedures and everything, the 
current state management in this case is incorrect. It's not about different 
assignment model; the specific part of the spec for the existing one is simply 
invalid, it doesn't account for this state, and it needs a small change to the 
state machine. I gave examples above why retry, while making the situation 
safer, doesn't solve the core problem (and also takes longer). 
Complicating state machine is not a reason to not have correct state machine.
It's the same as if we say "why have CONFIRM_CLOSED, it's just complicating 
state machine, we'll just send the close message, and assume region is 
closed"... it would just cause bugs, and while it could be "solved" in most 
cases with many hacks (e.g. sleep for 10 minutes) the proper solution is to add 
an extra state/transition and to confirm region is closed. It's the same 
here... retries can get e.g. network timeout, then on retry they can get 
CallQueueTooBig, or some other error, and it will assume that open didn't 
happen based on CallQueueToBig, but actually the first attempt started the 
open. So, the master has to retry until server dies or opens the region to be 
safe. But after master restart, new master will not know that it already got 
into this state and that it needs to retry accounting for it, for example.
So we can pile hacks to solve each new special case, or we can fix state 
machine properly...





was (Author: sershe):
This is  not about changing from procedures etc - that is a separate JIRA, out 
of the scope of this JIRA.
Also, double checking AssignRegionHandlerI do see that it checks the state, so 
that should be fine.

What I'm saying in this JIRA is that, keeping procedures and everything, the 
current state management in this case is incorrect. It's not about different 
assignment model; the specific part of the spec for the existing one is simply 
invalid, it doesn't account for this state, and it needs a small change to the 
state machine. I gave examples above why retry, while making the situation 
safer, doesn't solve the core problem (and also takes longer). 
Complicating state machine is not a reason to not have correct state machine.
It's the same as if we say "why have CONFIRM_CLOSED, it's just complicating 
state machine, we'll just send the close message, and assume region is 
closed"... it would just cause bugs, and while it could be "solved" in most 
cases with many hacks (e.g. sleep for 10 minutes) the proper solution is to add 
an extra state/transition and to confirm region is closed. It's the same 
here... retries can get e.g. network timeout, then on retry they can get 
CallQueueTooBig, or some other error, and it will assume that open didn't 
happen based on CallQueueToBig, but actually the first attempt started the 
open. So, the master has to retry until server dies or opens the region to be 
safe. But after master restart, new master will not know that it already got 
into this state and that it needs to retry accounting for it, for example.




> region can be assigned to 2 servers due to a timed-out call or an unknown 
> exception
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-21873
>                 URL: https://issues.apache.org/jira/browse/HBASE-21873
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Blocker
>             Fix For: 3.0.0, 2.2.0, 2.3.0
>
>         Attachments: HBASE-21862-forUT.patch, HBASE-21862-v1.patch, 
> HBASE-21862-v2.patch, HBASE-21862.patch
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS 
> actually processes it alright. It could also happen if the response didn't 
> make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that 
> will require fixing other network races where master kills RS, which would 
> require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the 
> server. Again, this would probably require protocol changes to make sure we 
> ascertain the region is not opened, and also that the 
> already-failed-on-master open is NOT going to be processed if it's some queue 
> or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a 
> manner that's both scalable and guaranteed correct. Theoretically it could be 
> all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] 
> procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, 
> location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] 
> assignment.RegionRemoteProcedureBase: The remote operation pid=260637, 
> ppid=260626, state=RUNNABLE, hasLock=false; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... 
> to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=60000^M
>         at 
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, 
> waitTime=60145, rpcTimeout=60000^M
>         at 
> org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
>         ... 4 more^M
>     {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  
> [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: 
> Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; 
> pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, 
> location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] 
> assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null; 
> forceNewPlan=true, retain=false
> 2019-02-08 03:22:33,238 INFO  [PEWorker-7] assignment.RegionStateStore: 
> pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, 
> regionState=OPENING, regionLocation=server2,17020,1549569075319
> {noformat}
> The ignore-message
> {noformat}
> 2019-02-08 03:25:44,754 WARN  
> [RpcServer.default.FPBQ.Fifo.handler=34,queue=4,port=17000] 
> assignment.TransitRegionStateProcedure: Received report OPENED transition 
> from server1,17020,1549623714617 for rit=OPENING, 
> location=server2,17020,1549569075319, table=table, 
> region=d0214809147e43dc6870005742d5d204, pid=260626 but the region is not on 
> it, should be a retry, ignore
> {noformat}
> The 2nd assignment
> {noformat}
> 2019-02-08 03:26:18,915 INFO  [PEWorker-7] procedure2.ProcedureExecutor: 
> Finished subprocedure(s) of pid=260626, ppid=260595, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN; resume parent processing.
> 2019-02-08 03:26:18,971 INFO  [PEWorker-11] procedure2.ProcedureExecutor: 
> Finished pid=260626, ppid=260595, state=SUCCESS, hasLock=false; 
> TransitRegionStateProcedure table=table, 
> region=d0214809147e43dc6870005742d5d204, ASSIGN in 4mins, 47.347sec
> {noformat}
> ======= by Duo Zhang ======
> The actual problem here is that, in IPCUtil.wrapException, we want to add the 
> remote server address in the exception message so it will be easier for 
> debugging, and there are several instanceof checks in it which is for keeping 
> the original exception type, since upper layer may depend on the exception 
> type for error recovery. But we do not check for CallTimeoutException in this 
> method so it will be wrapped by an IOException, which makes the code in 
> RSProcedureDispatcher broken, and causes the double assign.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21873) region can be assigned to 2 servers due to a timed-out call or an unknown exception

Reply via email to