[ 
https://issues.apache.org/jira/browse/HBASE-25142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214044#comment-17214044
 ] 

Josh Elser commented on HBASE-25142:
------------------------------------

{quote}The 'unkown server' means we have bugs in code so we are not 100% sure 
what is the real problem and whether it is 100% safe to fix.
{quote}
I keep coming back and thinking more about this. I've not made an exhaustive 
search (so maybe I'll rely on y'all to vet the idea), but I think we have two 
paths which might lead to an "unknown" server:
 * We lost WALs: both of Stack's example in the description are indicative of 
that. Just, one was intentional and the other was circumstantial.
 * We have a bug. I can respect Duo's caution around trying to automatically 
fix something that we don't yet understand.

Is there a third (or more) that I'm missing?

The second thing I keep wondering, say we do hit Duo's concern and have some 
unknown bug where we incorrectly clean up the WALs for a RS before the Master 
can submit an SCP (say it crashed as the same time and didn't see the ZK 
expiration). Is marking a server as unknown and reassigning regions that were 
on it going to cause problems _beyond_ the problems we already have (WALs went 
missing)? I can't come up with a situation that (as long as the server isn't 
registered in ZK), where starting to reassign regions which were once on that 
server is going to make matters worse. I fully acknowledge that I may be 
missing a subtle condition :)
{quote}I would suggest that we should at least provide an option for the 'auto' 
fix.
{quote}
Just making sure. Duo – you're saying you'd be happy with a solution in HBase 
to automatically fix this, as long as it was off-by-default?

> Auto-fix 'Unknown Server'
> -------------------------
>
>                 Key: HBASE-25142
>                 URL: https://issues.apache.org/jira/browse/HBASE-25142
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Michael Stack
>            Priority: Major
>
> Addressing reports of 'Unknown Server' has come up in various conversations 
> lately. This issue is about fixing instances of 'Unknown Server' 
> automatically as part of the tasks undertaken by CatalogJanitor when it runs.
> First though, would like to figure a definition for 'Unknown Server' and a 
> list of ways in which they arise. We need this to figure how to do safe 
> auto-fixing.
> Currently an 'Unknown Server' is a server found in hbase:meta that is not 
> online (no recent heartbeat) and that is not mentioned in the dead servers 
> list.
> In outline, I'd think CatalogJanitor could schedule an expiration of the RS 
> znode in zk (if exists) and then an SCP if it finds an 'Unknown Server'. 
> Perhaps it waits for 2x or 10x the heartbeat interval just-in-case (or not). 
> The SCP would clean up any references in hbase:meta by reassigning Regions 
> assigned the 'Unknown Server' after replaying any WALs found in hdfs 
> attributed to the dead server.
> As to how they arise:
>  * A contrived illustration would be a large online cluster crashes down with 
> a massive backlog of WAL files – zk went down for some reason say. The replay 
> of the WALs look like it could take a very long time  (lets say the cluster 
> was badly configured and a bug and misconfig made it so each RS was carrying 
> hundreds of WALs and there are hundreds of servers). To get the service back 
> online, the procedure store and WALs are moved aside (for later replay with 
> WALPlayer). The cluster comes up. meta is onlined but refers to server 
> instances that are no longer around. Can schedule an SCP per server mentioned 
> in the 'HBCK Report' by scraping and scripting hbck2 or, better, 
> catalogjanitor could just do it.
>  * HBASE-24286 HMaster won't become healthy after after cloning... describes 
> starting a cluster over data that is hfile-content only. In this case the 
> original servers used manufacture the hfile cluster data are long dead yet 
> meta still refers to the old servers. They will not make the 'dead servers' 
> list.
> Let this issue stew awhile. Meantime collect how 'Unknown Server' gets 
> created and best way to fix.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to