[ https://issues.apache.org/jira/browse/HBASE-19515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Stack resolved HBASE-19515. ----------------------------------- Resolution: Not A Problem Resolving as 'Not a Problem', fixed by HBASE-25032. Thanks [~anoop.hbase] for taking a look. > Region server left in online servers list forever if it went down after > registering to master and before creating ephemeral node > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-19515 > URL: https://issues.apache.org/jira/browse/HBASE-19515 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Reporter: Michael Stack > Priority: Critical > Fix For: 3.0.0-alpha-2 > > > This one is interesting. It was supposedly fixed long time ago back in > HBASE-9593 (The issue has same subject as this one) but there was a problem > w/ the fix reported later, post-commit, long after the issue was closed. The > 'fix' was registering ephemeral node in ZK BEFORE reporting in to the Master > for the first time. The problem w/ this approach is that the Master tells the > RS what name it should use reporting in. If we register in ZK before we talk > to the Master, the name in ZK and the one the RS ends up using could deviate. > In hbase2, we do the right thing registering the ephemeral node after we > report to the Master. So, the issue reported in HBASE-9593, that a RS that > dies between reporting to master and registering up in ZK, stays registered > at the Master for ever is back; we'll keep trying to assign it regions. Its a > real problem. > That hbase2 has this issue has been suppressed up until now. The test that > was written for HBASE-9593, TestRSKilledWhenInitializing, is a good test but > a little sloppy. It puts up two RSs aborting one only after registering at > the Master before posting to ZK. That leaves one healthy server up. It is > hosting hbase:meta. This is enough for the test to bluster through. The only > assign it does is namespace table. It goes to the hbase:meta server. If the > test created a new table and did roundrobin, it'd fail. > After HBASE-18946, where we do round robin on table create -- a desirable > attribute -- via the balancer so all is kosher, the test > TestRSKilledWhenInitializing now starts to fail because we chose the hobbled > server most of the time. > So, this issue is about fixing the original issue properly for hbase2. We > don't have a timeout on assign in AMv2, not yet, that might be the fix, or > perhaps a double report before we online a server with the second report > coming in after ZK goes up (or we stop doing ephemeral nodes for RS up in ZK > and just rely on heartbeats....). > Making this a critical issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)