[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14592814#comment-14592814 ]
Duo Zhang commented on HBASE-13937: ----------------------------------- For isServerReachable, I think we can say a server is 'dead' if one of the following conditions is satisfied 1. Server tells us it is dead(I'm not sure whether a RegionServerStoppedException is enough, maybe it will be thrown before regionserver completely shutdown?) 2. It is a server with another start code. 3. We get a connection refused(not connect timeout). So I think remove the code is reasonable, but we should catch a connection refused exception then(Does our rpc framework throw this exception out, and also we do not need retry here...)? Otherwise if we do not restart a regionserver then we will be stuck in the loop for a long time... And also I do not think it is safe to return 'not reachable' if timeout... Thanks. > Partially revert HBASE-13172 > ----------------------------- > > Key: HBASE-13937 > URL: https://issues.apache.org/jira/browse/HBASE-13937 > Project: HBase > Issue Type: Sub-task > Components: Region Assignment > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 > > Attachments: hbase-13937_v1.patch > > > HBASE-13172 is supposed to fix a UT issue, but causes other problems that > parent jira (HBASE-13605) is attempting to fix. > However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, > to put it mildly, major design flaws in AM / RS. > Regardless of 13605, the issue with 13172 is that we catch > {{ServerNotRunningYetException}} from {{isServerReachable()}} and return > false, which then puts the Server to the {{RegionStates.deadServers}} list. > Once it is in that list, we can still assign and unassign regions to the RS > after it has started (because regular assignment does not check whether the > server is in {{RegionStates.deadServers}}. However, after the first assign > and unassign, we cannot assign the region again since then the check for the > lastServer will think that the server is dead. > It turns out that a proper patch for 13605 is very hard without fixing rest > of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a > colorful history). For 1.1.1, I think we should just revert parts of > HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)