[
https://issues.apache.org/jira/browse/HBASE-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198598#comment-17198598
]
Duo Zhang commented on HBASE-25059:
-----------------------------------
It is a common problem in distributed rpc: if a request has been sent out, we
can not know the result unless:
1. The remote side send something back so we know the result.
2. Through other channels to get the state of the remote side so we can
determine the result.
For 1, all the decision except server dead is for this scenario. A connect
exception is sent by the OS so we know that the rs does not receive the
request. A queue full exception at the first time means the rs receives the
request but does not process it. And another place is reportRegionTransition,
where rs tells us directly the result of the request.
For 2, currently the only way is to abort the rs, as we have fencing for this
scenario to make sure that the dead rs can not carry any regions.
Hope this could help you better understand the current design? Maybe add this
to the java doc of AM?
Thanks.
> TransitionRegionStateProcedure should timeout, rollback, retry instead of
> waiting infinitely on CONFIRMED_OPEN
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-25059
> URL: https://issues.apache.org/jira/browse/HBASE-25059
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 2.3.2
> Reporter: Nick Dimiduk
> Priority: Major
>
> Testing 2.3.2RC1 with ITBLL. The region server assigned to open meta locked
> up due to HBASE-24896. Meanwhile, the master waits indefinitely on a
> procedure {{pid=176583, ppid=176532,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN}}.
> AssignmentManager needs a way to rescind assignment when a RS fails to
> complete within a reasonable timeout window, roll back the procedure, and try
> again with a new target.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)