[
https://issues.apache.org/jira/browse/HBASE-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197898#comment-17197898
]
Nick Dimiduk commented on HBASE-25059:
--------------------------------------
In this case, after 220 attempts receiving {{CallTimeoutException}}, the region
server starts responding with {{CallQueueTooBigException}}. Still we never give
up.
Down in {{RSProcedureDispatcher$ExecuteProceduresRemoteCall#scheduleForRetry}},
I see no consideration for {{CallTimeoutException}}. There is handling for
{{CallQueueTooBigException}}, but it's a highly specialized case.
{noformat}
// This exception is thrown in the rpc framework, where we can make sure
that the call has not
// been executed yet, so it is safe to mark it as fail. Especially for
open a region, we'd
// better choose another region server.
// Notice that, it is safe to quit only if this is the first time we send
request to region
// server. Maybe the region server has accepted our request the first
time, and then there is
// a network error which prevents we receive the response, and the second
time we hit a
// CallQueueTooBigException, obviously it is not safe to quit here,
otherwise it may lead to a
// double assign...
if (e instanceof CallQueueTooBigException && numberOfAttemptsSoFar == 0) {
LOG.warn("request to {} failed due to {}, try={}, this usually because"
+
" server is overloaded, give up", serverName, e.toString(),
numberOfAttemptsSoFar);
return false;
}
{noformat}
> TransitionRegionStateProcedure should timeout, rollback, retry instead of
> waiting infinitely on CONFIRMED_OPEN
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-25059
> URL: https://issues.apache.org/jira/browse/HBASE-25059
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 2.3.2
> Reporter: Nick Dimiduk
> Priority: Major
>
> Testing 2.3.2RC1 with ITBLL. The region server assigned to open meta locked
> up due to HBASE-24896. Meanwhile, the master waits indefinitely on a
> procedure {{pid=176583, ppid=176532,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN}}.
> AssignmentManager needs a way to rescind assignment when a RS fails to
> complete within a reasonable timeout window, roll back the procedure, and try
> again with a new target.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)