[
https://issues.apache.org/jira/browse/HBASE-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107908#comment-17107908
]
Hudson commented on HBASE-24368:
--------------------------------
Results for branch branch-2.3
[build #88 on
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> Let HBCKSCP clear 'Unknown Servers', even if RegionStateNode has
> RegionLocation == null
> ---------------------------------------------------------------------------------------
>
> Key: HBASE-24368
> URL: https://issues.apache.org/jira/browse/HBASE-24368
> Project: HBase
> Issue Type: Bug
> Components: hbck2
> Affects Versions: 2.3.0
> Reporter: Michael Stack
> Assignee: Michael Stack
> Priority: Major
> Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> This is an incidental noticed when in a hole trying to fix up a cluster. The
> 'obvious' remediation didn't work. This issue is about addressing this.
> HBASE-23594 added a filtering of Regions on the crashed server to handle the
> case where an Assign may be concurrent to the ServerCrashProcedure. To avoid
> double assign, the SCP will skip assign if the RegionStateNode RegionLocation
> is not that of the crashed server.
> This is good.
> Where it is an obstacle is when a Region is stuck in OPENING state, it
> references an 'Unknown Server' -- a server no longer tracked by the Master --
> and there is no assign currently in flight. In this case, scheduling a
> ServerCrashProcedure to clean up the reference to the Unknown Server and to
> get the Region reassigned skips out when RegionStateNode in Master has a
> RegionLocation that does not match that of the ServerCrashProcedure, even
> when it is set to null (we set the RegionLocation to null when we fail an
> assign as we might if the server no longer is part of the cluster).
> For background, cluster had a RIT. The RIT was a Region failing to open
> because of a missing Reference (Another issue). The Region open would fail
> with a FileNotFoundException. The master would attempt assign and then would
> fail when it went to confirm OPEN, logging the complaint about FNFE asking
> for operator intervention in master logs.
> This state was in place for weeks on this particular cluster (a dev cluster
> not under close observation). The cluster had been restarted once or twice so
> the server the Region had once been on was no longer 'known' but it still had
> an entry in the hbase:meta table as last location assigned (The now 'Unknown
> Server').
> To fix, I went about the task in the wrong order. I bypassed the long-running
> stuck procedure to terminate it and cleanup 'Procedures and Locks'. Mistake.
> Now there was no longer an assign Procedure for this Region. But I now had a
> Region in OPENING state with a reference to an unknown server with an
> in-memory RegionStateNode whose RegionLocation was null (set null on each
> failed assign). Running catalogjanitor_run and hbck_chore_report had the
> unknown server show in the 'HBCK Report' in the 'Unknown Servers' list.
> Attempts at assign fail because Region is in OPENING state -- you can't
> assign a Region in OPENING state. Scheduling an HBCKSCP via hbck2
> scheduleRecoveries always generated the below in the logs.
> {code}
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=157217,
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; HBCKServerCrashProcedure
> server=unknown_server.example.com,16020,1587577972683, splitWal=true,
> meta=false found a region state=OPENING, location=null,
> table=bobby_analytics, region=1501ea3bd822c1a3e4e6216ea48733bd which is no
> longer on us unknown_server.example.com,16020,1587577972683, give up
> assigning...
> {code}
> My workaround was setting region state to CLOSED with hbck2 and then doing an
> assign with hbck2. At this point I noticed the FNFE. Easier if the HBCKSCP
> worked.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)