[
https://issues.apache.org/jira/browse/HBASE-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967109#comment-16967109
]
Hudson commented on HBASE-23247:
--------------------------------
Results for branch branch-2
[build #2344 on
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344//General_Nightly_Build_Report/]
(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344//JDK8_Nightly_Build_Report_(Hadoop2)/]
(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3)
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2344//JDK8_Nightly_Build_Report_(Hadoop3)/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(/) {color:green}+1 client integration test{color}
> [hbck2] Schedule SCPs for 'Unknown Servers'
> -------------------------------------------
>
> Key: HBASE-23247
> URL: https://issues.apache.org/jira/browse/HBASE-23247
> Project: HBase
> Issue Type: Bug
> Components: hbck2
> Affects Versions: 2.2.2
> Reporter: Michael Stack
> Assignee: Michael Stack
> Priority: Major
> Fix For: 2.2.3
>
>
> I've run into an 'Unknown Server' phenomenon.Meta has regions assigned to
> servers that the cluster no longer knows about. You can see the list in the
> 'HBCK Report' page down the end (run 'catalogjanitor_run' in the shell to
> generate a fresh report). Fix is tough if you try to do
> unassign/assign/close/etc. because new assign/unassign is insistent on
> checking the close succeeded by trying to contact the 'unknown server' and
> being insistent on not moving on until it succeeds; TODO. There are a few
> ways of obtaining this state of affairs. I'll list a few below in a minute.
> Meantime, an hbck2 'fix' seems just the ticket; Run a SCP for the 'Unknown
> Server' and it should clear the meta of all the bad server references.... So
> just schedule an SCP using scheduleRecoveries command....only in this case it
> fails before scheduling SCP with the below; i.e. a FNFE because no dir for
> the 'Unknown Server'.
> {code}
> 22:41:13.909 [main] INFO
> org.apache.hadoop.hbase.client.ConnectionImplementation - Closing master
> protocol: MasterService
> Exception in thread "main" java.io.IOException:
> org.apache.hbase.thirdparty.com.google.protobuf.ServiceException:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.FileNotFoundException):
> java.io.FileNotFoundException: File
> hdfs://nameservice1/hbase/genie/WALs/s1.d.com,16020,1571170081872 does not
> exist.
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:986)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:122)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1046)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1043)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1053)
> at
> org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1802)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1844)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.containMetaWals(MasterRpcServices.java:2709)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2488)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> at
> org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:175)
> at
> org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:118)
> at org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:345)
> at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:746)
> at org.apache.hbase.HBCK2.run(HBCK2.java:631)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at org.apache.hbase.HBCK2.main(HBCK2.java:865)
> {code}
> A simple fix makes it so I can schedule an SCP which indeed clears out the
> 'Unknown Server' to restore saneness on the cluster.
> As to how to get 'Unknown Server':
> 1. The current scenario came about because of this exception while processing
> a server crash procedure made it so the SCP exited just after splitting logs
> but before it cleared old assigns. A new server instance that came up after
> this one went down purged the server from dead servers list though there were
> still Procedures in flight (The cluster was under a crippling overloading)
> {code}
> 2019-11-02 21:02:34,775 DEBUG
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting
> WALs pid=112532, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true;
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
> meta=false
> 2019-11-02 21:02:34,775 DEBUG
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
> pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
> meta=false as the 2th rollback step
> 2019-11-02 21:02:34,779 INFO
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=112532,
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure
> server=s1.d.com,16020,1572668980355, splitWal=true, meta=false found RIT
> pid=101251, ppid=101123, state=SUCCESS, bypass=LOG-REDACTED
> TransitRegionStateProcedure
> table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359,
> ASSIGN; rit=OPENING, location=s1.d.com,16020,1572668980355,
> table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359
> 2019-11-02 21:02:34,779 ERROR
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure server=s1.d.com,16020,1572668980355,
> splitWal=true, meta=false
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:139)
> at
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:132)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:786)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:741)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:605)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:183)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:240)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:409)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
> 2019-11-02 21:02:34,779 DEBUG
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
> pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true,
> exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
> exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
> meta=false:java.lang.NullPointerException; ServerCrashProcedure
> server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the 3th
> rollback step
> 2019-11-02 21:02:34,782 ERROR
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN,
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
> runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355,
> splitWal=true, meta=false:java.lang.NullPointerException;
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
> meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
> 2019-11-02 21:02:34,785 ERROR
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN,
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
> runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure server=s1.d. com,16020,1572668980355,
> splitWal=true, meta=false:java.lang.NullPointerException;
> ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
> meta=false
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
> {code}
> 2. I'm pretty sure I ran into this when I cleared out the MasterProcWAL to
> start over fresh.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)