[ 
https://issues.apache.org/jira/browse/HBASE-19515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack resolved HBASE-19515.
-----------------------------------
    Resolution: Not A Problem

Resolving as 'Not a Problem', fixed by HBASE-25032. Thanks [~anoop.hbase] for 
taking a look.

> Region server left in online servers list forever if it went down after 
> registering to master and before creating ephemeral node
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-19515
>                 URL: https://issues.apache.org/jira/browse/HBASE-19515
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>            Reporter: Michael Stack
>            Priority: Critical
>             Fix For: 3.0.0-alpha-2
>
>
> This one is interesting. It was supposedly fixed long time ago back in 
> HBASE-9593 (The issue has same subject as this one) but there was a problem 
> w/ the fix reported later, post-commit, long after the issue was closed. The 
> 'fix' was registering ephemeral node in ZK BEFORE reporting in to the Master 
> for the first time. The problem w/ this approach is that the Master tells the 
> RS what name it should use reporting in. If we register in ZK before we talk 
> to the Master, the name in ZK and the one the RS ends up using could deviate.
> In hbase2, we do the right thing registering the ephemeral node after we 
> report to the Master. So, the issue reported in HBASE-9593, that a RS that 
> dies between reporting to master and registering up in ZK, stays registered 
> at the Master for ever is back; we'll keep trying to assign it regions. Its a 
> real problem.
> That hbase2 has this issue has been suppressed up until now. The test that 
> was written for HBASE-9593, TestRSKilledWhenInitializing, is a good test but 
> a little sloppy. It puts up two RSs aborting one only after registering at 
> the Master before posting to ZK. That leaves one healthy server up. It is 
> hosting hbase:meta. This is enough for the test to bluster through. The only 
> assign it does is namespace table. It goes to the hbase:meta server. If the 
> test created a new table and did roundrobin, it'd fail.
> After HBASE-18946, where we do round robin on table create -- a desirable 
> attribute -- via the balancer so all is kosher, the test 
> TestRSKilledWhenInitializing now starts to fail because we chose the hobbled 
> server most of the time.
> So, this issue is about fixing the original issue properly for hbase2. We 
> don't have a timeout on assign in AMv2, not yet, that might be the fix, or 
> perhaps a double report before we online a server with the second report 
> coming in after ZK goes up (or we stop doing ephemeral nodes for RS up in ZK 
> and just rely on heartbeats....).
> Making this a critical issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to