[jira] [Commented] (HBASE-24368) Let HBCKSCP clear 'Unknown Servers', even if RegionStateNode has RegionLocation == null

Hudson (Jira) Thu, 14 May 2020 21:01:27 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-24368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107908#comment-17107908
 ]


Hudson commented on HBASE-24368:
--------------------------------

Results for branch branch-2.3
        [build #88 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/]: 
(x) *{color:red}-1 overall{color}*
----
details (if available):

(/) {color:green}+1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/General_20Nightly_20Build_20Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(x) {color:red}-1 jdk11 hadoop3 checks{color}
-- For more information [see jdk11 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2.3/88/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Let HBCKSCP clear 'Unknown Servers', even if RegionStateNode has 
> RegionLocation == null
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-24368
>                 URL: https://issues.apache.org/jira/browse/HBASE-24368
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck2
>    Affects Versions: 2.3.0
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0
>
>
> This is an incidental noticed when in a hole trying to fix up a cluster. The 
> 'obvious' remediation didn't work. This issue is about addressing this.
> HBASE-23594 added a filtering of Regions on the crashed server to handle the 
> case where an Assign may be concurrent to the ServerCrashProcedure. To avoid 
> double assign, the SCP will skip assign if the RegionStateNode RegionLocation 
> is not that of the crashed server.
> This is good.
> Where it is an obstacle is when a Region is stuck in OPENING state, it 
> references an 'Unknown Server' -- a server no longer tracked by the Master -- 
> and there is no assign currently in flight. In this case, scheduling a 
> ServerCrashProcedure to clean up the reference to the Unknown Server and to 
> get the Region reassigned skips out when RegionStateNode in Master has a 
> RegionLocation that does not match that of the ServerCrashProcedure, even 
> when it is set to null (we set the RegionLocation to null when we fail an 
> assign as we might if the server no longer is part of the cluster).
> For background, cluster had a RIT. The RIT was a Region failing to open 
> because of a missing Reference (Another issue). The Region open would fail 
> with a FileNotFoundException. The master would attempt assign and then would 
> fail when it went to confirm OPEN, logging the complaint about FNFE asking 
> for operator intervention in master logs.
> This state was in place for weeks on this particular cluster (a dev cluster 
> not under close observation). The cluster had been restarted once or twice so 
> the server the Region had once been on was no longer 'known' but it still had 
> an entry in the hbase:meta table as last location assigned (The now 'Unknown 
> Server').
> To fix, I went about the task in the wrong order. I bypassed the long-running 
> stuck procedure to terminate it and cleanup 'Procedures and Locks'. Mistake. 
> Now there was no longer an assign Procedure for this Region. But I now had a 
> Region in OPENING state with a reference to an unknown server with an 
> in-memory RegionStateNode whose RegionLocation was null (set null on each 
> failed assign). Running catalogjanitor_run and hbck_chore_report had the 
> unknown server show in the 'HBCK Report' in the 'Unknown Servers' list. 
> Attempts at assign fail because Region is in OPENING state -- you can't 
> assign a Region in OPENING state. Scheduling an HBCKSCP via hbck2 
> scheduleRecoveries always generated the below in the logs.
> {code}
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=157217, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; HBCKServerCrashProcedure 
> server=unknown_server.example.com,16020,1587577972683, splitWal=true, 
> meta=false found a region state=OPENING, location=null, 
> table=bobby_analytics, region=1501ea3bd822c1a3e4e6216ea48733bd which is no 
> longer on us unknown_server.example.com,16020,1587577972683, give up 
> assigning...
> {code}
> My workaround was setting region state to CLOSED with hbck2 and then doing an 
> assign with hbck2. At this point I noticed the FNFE. Easier if the HBCKSCP 
> worked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-24368) Let HBCKSCP clear 'Unknown Servers', even if RegionStateNode has RegionLocation == null

Reply via email to