[
https://issues.apache.org/jira/browse/HBASE-25142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225148#comment-17225148
]
Michael Stack commented on HBASE-25142:
---------------------------------------
Just ran into this again. Updating a cluster from hbase-2.1.x. The procedure
store WALs was corrupt. I was unable to fix. Updated to an hbase-2.3. SCPs
were present for all servers but making no progress (Later I found that the
master would get 'stuck' unable to make progress accessing hdfs – this didn't
help). I moved aside corrupt store. Hand assigned meta and namespace tables.
Most Regions assigned but am left with 500 in RIT and 1200 reports of unknown
servers (probably lots of overlap w/ the queued SCP list that had been moved
aside). I copy/paste unknown servers section to text file, parse out and sort
unique servernames (~200)... and then per servername, script it so I do a
schedule recovery per server – one at a time because I had a bad experience
once doing many at a time and don't want to fix 'that' problem just now.
Painful.
Later I ran into it on another cluster altogether where there was a strange
'hdfs hang'; i.e. the SCP was making no progress on WAL splitting. I bypassed,
changed Masters and retried but the SCP got stuck again. Odd. I had to leave it
and then the next day when no load on the cluster the schedule of an SCP just
went through and all was cleaned up. An auto-fix would have piled up SCPs...
all stuck against HDFS.
This latter makes me think that better to add fix of 'unknown servers' to
'hbck2 fixMeta' for now rather than auto-fix as the subject has it.
> Auto-fix 'Unknown Server'
> -------------------------
>
> Key: HBASE-25142
> URL: https://issues.apache.org/jira/browse/HBASE-25142
> Project: HBase
> Issue Type: Improvement
> Reporter: Michael Stack
> Priority: Major
>
> Addressing reports of 'Unknown Server' has come up in various conversations
> lately. This issue is about fixing instances of 'Unknown Server'
> automatically as part of the tasks undertaken by CatalogJanitor when it runs.
> First though, would like to figure a definition for 'Unknown Server' and a
> list of ways in which they arise. We need this to figure how to do safe
> auto-fixing.
> Currently an 'Unknown Server' is a server found in hbase:meta that is not
> online (no recent heartbeat) and that is not mentioned in the dead servers
> list.
> In outline, I'd think CatalogJanitor could schedule an expiration of the RS
> znode in zk (if exists) and then an SCP if it finds an 'Unknown Server'.
> Perhaps it waits for 2x or 10x the heartbeat interval just-in-case (or not).
> The SCP would clean up any references in hbase:meta by reassigning Regions
> assigned the 'Unknown Server' after replaying any WALs found in hdfs
> attributed to the dead server.
> As to how they arise:
> * A contrived illustration would be a large online cluster crashes down with
> a massive backlog of WAL files – zk went down for some reason say. The replay
> of the WALs look like it could take a very long time (lets say the cluster
> was badly configured and a bug and misconfig made it so each RS was carrying
> hundreds of WALs and there are hundreds of servers). To get the service back
> online, the procedure store and WALs are moved aside (for later replay with
> WALPlayer). The cluster comes up. meta is onlined but refers to server
> instances that are no longer around. Can schedule an SCP per server mentioned
> in the 'HBCK Report' by scraping and scripting hbck2 or, better,
> catalogjanitor could just do it.
> * HBASE-24286 HMaster won't become healthy after after cloning... describes
> starting a cluster over data that is hfile-content only. In this case the
> original servers used manufacture the hfile cluster data are long dead yet
> meta still refers to the old servers. They will not make the 'dead servers'
> list.
> Let this issue stew awhile. Meantime collect how 'Unknown Server' gets
> created and best way to fix.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)