[ 
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765573#comment-16765573
 ] 

Sergey Shelukhin edited comment on HBASE-21863 at 2/12/19 1:17 AM:
-------------------------------------------------------------------

[~stack] the deadline-based exception here, if the master sees it thrown by the 
RS, looks to master just like the call failing with some random error. So, it's 
just another case of the call errored, no change to the spec - we will just, 
very infrequently, error out the call by mistake that could have succeeded and 
retry it elsewhere.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call (or rather, 
not execute it and throw this exception) but on master the call will have 
already timed out - due to a network issue, call queue issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).


was (Author: sershe):
[~stack] the deadline-based exception here, if the master sees it thrown by the 
RS, looks to master just like the call failing with some random error. So, it's 
just another case of the call errored, no change to the spec - we will just, 
very infrequently, error out the call by mistake that could have succeeded and 
retry it elsewhere.

The issue this is trying to mitigate is that spec is missing the situation 
where calls actually time out but succeed (this is the same sort of issue that 
nonces solve for increment). This doesn't fix the issue, but tries to figure 
out if it's happening (i.e. that we are running the call past master's RPC 
timeout). So, in most cases if we hit this, we will fail the call but on master 
the call will have already timed out - due to a network issue, call queue 
issue, or smth else. 
The fact that calls can timeout, and the region state after that is unknown, is 
a spec issue for which the fix will be more involved (I mentioned a couple of 
options in that JIRA, but the cleanest way is basically to perform a variation 
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will 
also interact in an already-existing way with SCP if the server dies).

> narrow down the double-assignment race window
> ---------------------------------------------
>
>                 Key: HBASE-21863
>                 URL: https://issues.apache.org/jira/browse/HBASE-21863
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to