Michael Stack created HBASE-23247:
-------------------------------------
Summary: [hbck2] Schedule SCPs for 'Unknown Servers'
Key: HBASE-23247
URL: https://issues.apache.org/jira/browse/HBASE-23247
Project: HBase
Issue Type: Bug
Components: hbck2
Affects Versions: 2.2.2
Reporter: Michael Stack
Assignee: Michael Stack
Fix For: 2.2.3
I've run into an 'Unknown Server' phenomenon; meta has regions assigned to
servers that the cluster no longer knows about. Fix is tough because new assign
is insistent on checking the close succeeded by trying to contact the 'unknown
server' and being insistent on not moving on until it succeeds; TODO. There are
a few ways of obtaining this state of affairs. I'll list a few below in a
minute.
Meantime, an hbck2 'fix' should be just scheduling an SCP using
scheduleRecoveries command only in this case it falis before scheduling SCP
with the below; i.e. a FNFE because no dir for the 'Unknown Server'.
{code}
22:41:13.909 [main] INFO
org.apache.hadoop.hbase.client.ConnectionImplementation - Closing master
protocol: MasterService
Exception in thread "main" java.io.IOException:
org.apache.hbase.thirdparty.com.google.protobuf.ServiceException:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.FileNotFoundException):
java.io.FileNotFoundException: File
hdfs://nameservice1/hbase/genie/WALs/s1.d.com,16020,1571170081872 does not
exist.
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:986)
at
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:122)
at
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1046)
at
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1043)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1053)
at
org.apache.hadoop.fs.FilterFileSystem.listStatus(FilterFileSystem.java:258)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1802)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1844)
at
org.apache.hadoop.hbase.master.MasterRpcServices.containMetaWals(MasterRpcServices.java:2709)
at
org.apache.hadoop.hbase.master.MasterRpcServices.scheduleServerCrashProcedure(MasterRpcServices.java:2488)
at
org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$HbckService$2.callBlockingMethod(MasterProtos.java)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
at
org.apache.hadoop.hbase.client.HBaseHbck.scheduleServerCrashProcedures(HBaseHbck.java:175)
at
org.apache.hadoop.hbase.client.Hbck.scheduleServerCrashProcedure(Hbck.java:118)
at org.apache.hbase.HBCK2.scheduleRecoveries(HBCK2.java:345)
at org.apache.hbase.HBCK2.doCommandLine(HBCK2.java:746)
at org.apache.hbase.HBCK2.run(HBCK2.java:631)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hbase.HBCK2.main(HBCK2.java:865)
{code}
A simple fix makes it so I can schedule an SCP which indeed clears out the
'Unknown Server' to restore saneness on the cluster.
As to how to get 'Unknown Server':
1. The current scenario came about because of this exception while processing a
server crash procedure made it so the SCP exited just after splitting logs but
before it cleared old assigns. A new server instance that came up after this
one went down purged the server from dead servers list though there were still
Procedures in flight (The cluster was under a crippling overloading)
{code}
2019-11-02 21:02:34,775 DEBUG
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Done splitting
WALs pid=112532, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, locked=true;
ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
meta=false
2019-11-02 21:02:34,775 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
meta=false as the 2th rollback step
2019-11-02 21:02:34,779 INFO
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=112532,
state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure
server=s1.d.com,16020,1572668980355, splitWal=true, meta=false found RIT
pid=101251, ppid=101123, state=SUCCESS, bypass=LOG-REDACTED
TransitRegionStateProcedure
table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359,
ASSIGN; rit=OPENING, location=s1.d.com,16020,1572668980355,
table=GENIE2_modality_syncdata, region=fd2bd0f540756b8eba4c99301d2cf359
2019-11-02 21:02:34,779 ERROR
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
meta=false
java.lang.NullPointerException
at
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:139)
at
org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:132)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:786)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:741)
at
org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:605)
at
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:183)
at
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:240)
at
org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:409)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
at
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:194)
at
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1648)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1395)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
2019-11-02 21:02:34,779 DEBUG
org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure
pid=112532, state=FAILED:SERVER_CRASH_ASSIGN, locked=true,
exception=java.lang.NullPointerException via CODE-BUG: Uncaught runtime
exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
ServerCrashProcedure server=s1.d.com,16020,1572668980355, splitWal=true,
meta=false:java.lang.NullPointerException; ServerCrashProcedure
server=s1.d.com,16020,1572668980355, splitWal=true, meta=false as the 3th
rollback step
2019-11-02 21:02:34,782 ERROR
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN,
locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
ServerCrashProcedure server=s1.d. com,16020,1572668980355, splitWal=true,
meta=false:java.lang.NullPointerException; ServerCrashProcedure
server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
at
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
at
org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
2019-11-02 21:02:34,785 ERROR
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
runtime exception for pid=112532, state=FAILED:SERVER_CRASH_ASSIGN,
locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
runtime exception: pid=112532, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true;
ServerCrashProcedure server=s1.d. com,16020,1572668980355, splitWal=true,
meta=false:java.lang.NullPointerException; ServerCrashProcedure
server=s1.d.com,16020,1572668980355, splitWal=true, meta=false
java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
at
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
at
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:219)
at
org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1569)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1501)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1352)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1965)
{code}
2. I'm pretty sure I ran into this when I cleared out the MasterProcWAL to
start over fresh.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)