[
https://issues.apache.org/jira/browse/HBASE-21863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765573#comment-16765573
]
Sergey Shelukhin edited comment on HBASE-21863 at 2/12/19 12:51 AM:
--------------------------------------------------------------------
[~stack] the deadline-based exception here, if the master sees it thrown by the
RS, looks to master just like the call failing with some random error. So, it's
just another case of the call errored, no change to the spec - we will just,
very infrequently, error out the call by mistake that could have succeeded.
The issue this is trying to mitigate is that spec is missing the situation
where calls actually time out but succeed (this is the same sort of issue that
nonces solve for increment). This doesn't fix the issue, but tries to figure
out if it's happening (i.e. that we are running the call past master's RPC
timeout). So, in most cases if we hit this, we will fail the call but on master
the call will have already timed out - due to a network issue, call queue
issue, or smth else.
The fact that calls can timeout, and the region state after that is unknown, is
a spec issue for which the fix will be more involved (I mentioned a couple of
options in that JIRA, but the cleanest way is basically to perform a variation
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will
also interact in an already-existing way with SCP if the server dies).
was (Author: sershe):
[~stack] the timeout here, if the master sees it, looks to master just like the
call failing with some random error. So, it's just another case of the call
errored, no change to the spec - we will just, very infrequently, error out the
call by mistake that could have succeeded.
The issue this is trying to mitigate is that spec is missing the situation
where calls actually time out but succeed (this is the same sort of issue that
nonces solve for increment). This doesn't fix the issue, but tries to figure
out if it's happening (i.e. that we are running the call past master's RPC
timeout). So, in most cases if we hit this, we will fail the call but on master
the call will have already timed out - due to a network issue, call queue
issue, or smth else.
The fact that calls can timeout, and the region state after that is unknown, is
a spec issue for which the fix will be more involved (I mentioned a couple of
options in that JIRA, but the cleanest way is basically to perform a variation
of CONFIRM_CLOSED for that region, to ensure the RS will not open it; that will
also interact in an already-existing way with SCP if the server dies).
> narrow down the double-assignment race window
> ---------------------------------------------
>
> Key: HBASE-21863
> URL: https://issues.apache.org/jira/browse/HBASE-21863
> Project: HBase
> Issue Type: Bug
> Reporter: Sergey Shelukhin
> Assignee: Sergey Shelukhin
> Priority: Major
> Attachments: HBASE-21863.01.patch, HBASE-21863.patch
>
>
> See HBASE-21862.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)