[
https://issues.apache.org/jira/browse/HBASE-25142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214044#comment-17214044
]
Josh Elser commented on HBASE-25142:
------------------------------------
{quote}The 'unkown server' means we have bugs in code so we are not 100% sure
what is the real problem and whether it is 100% safe to fix.
{quote}
I keep coming back and thinking more about this. I've not made an exhaustive
search (so maybe I'll rely on y'all to vet the idea), but I think we have two
paths which might lead to an "unknown" server:
* We lost WALs: both of Stack's example in the description are indicative of
that. Just, one was intentional and the other was circumstantial.
* We have a bug. I can respect Duo's caution around trying to automatically
fix something that we don't yet understand.
Is there a third (or more) that I'm missing?
The second thing I keep wondering, say we do hit Duo's concern and have some
unknown bug where we incorrectly clean up the WALs for a RS before the Master
can submit an SCP (say it crashed as the same time and didn't see the ZK
expiration). Is marking a server as unknown and reassigning regions that were
on it going to cause problems _beyond_ the problems we already have (WALs went
missing)? I can't come up with a situation that (as long as the server isn't
registered in ZK), where starting to reassign regions which were once on that
server is going to make matters worse. I fully acknowledge that I may be
missing a subtle condition :)
{quote}I would suggest that we should at least provide an option for the 'auto'
fix.
{quote}
Just making sure. Duo – you're saying you'd be happy with a solution in HBase
to automatically fix this, as long as it was off-by-default?
> Auto-fix 'Unknown Server'
> -------------------------
>
> Key: HBASE-25142
> URL: https://issues.apache.org/jira/browse/HBASE-25142
> Project: HBase
> Issue Type: Improvement
> Reporter: Michael Stack
> Priority: Major
>
> Addressing reports of 'Unknown Server' has come up in various conversations
> lately. This issue is about fixing instances of 'Unknown Server'
> automatically as part of the tasks undertaken by CatalogJanitor when it runs.
> First though, would like to figure a definition for 'Unknown Server' and a
> list of ways in which they arise. We need this to figure how to do safe
> auto-fixing.
> Currently an 'Unknown Server' is a server found in hbase:meta that is not
> online (no recent heartbeat) and that is not mentioned in the dead servers
> list.
> In outline, I'd think CatalogJanitor could schedule an expiration of the RS
> znode in zk (if exists) and then an SCP if it finds an 'Unknown Server'.
> Perhaps it waits for 2x or 10x the heartbeat interval just-in-case (or not).
> The SCP would clean up any references in hbase:meta by reassigning Regions
> assigned the 'Unknown Server' after replaying any WALs found in hdfs
> attributed to the dead server.
> As to how they arise:
> * A contrived illustration would be a large online cluster crashes down with
> a massive backlog of WAL files – zk went down for some reason say. The replay
> of the WALs look like it could take a very long time (lets say the cluster
> was badly configured and a bug and misconfig made it so each RS was carrying
> hundreds of WALs and there are hundreds of servers). To get the service back
> online, the procedure store and WALs are moved aside (for later replay with
> WALPlayer). The cluster comes up. meta is onlined but refers to server
> instances that are no longer around. Can schedule an SCP per server mentioned
> in the 'HBCK Report' by scraping and scripting hbck2 or, better,
> catalogjanitor could just do it.
> * HBASE-24286 HMaster won't become healthy after after cloning... describes
> starting a cluster over data that is hfile-content only. In this case the
> original servers used manufacture the hfile cluster data are long dead yet
> meta still refers to the old servers. They will not make the 'dead servers'
> list.
> Let this issue stew awhile. Meantime collect how 'Unknown Server' gets
> created and best way to fix.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)