stack created HBASE-21124:
-----------------------------
Summary: SCP stuck unable to split WALs (NoServerDispatchException
and STUCK regions in CLOSING in logs)
Key: HBASE-21124
URL: https://issues.apache.org/jira/browse/HBASE-21124
Project: HBase
Issue Type: Bug
Components: amv2
Reporter: stack
At first blush, we go to dispatch an unassign to a server that is not online, a
NoServerDispatchException is thrown. The server is not online, so a SCP is NOT
scheduled and the region close does not go to completion. Eventually regions
complain they are STUCK.
{code}
2018-08-27 15:43:41,578 WARN [PEWorker-13]
assignment.RegionTransitionProcedure: Remote call failed
ve0530.halxg.cloudera.com,16020,1535387669315; pid=20723, ppid=20578,
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure
table=IntegrationTestBigLinkedList, region=464487fa9329df991cae84519adf253c,
server=ve0530.halxg.cloudera.com,16020,1535387669315; rit=CLOSING,
location=ve0530.halxg.cloudera.com,16020,1535387669315;
exception=NoServerDispatchException
org.apache.hadoop.hbase.procedure2.NoServerDispatchException:
ve0530.halxg.cloudera.com,16020,1535387669315; pid=20723, ppid=20578,
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure
table=IntegrationTestBigLinkedList, region=464487fa9329df991cae84519adf253c,
server=ve0530.halxg.cloudera.com,16020,1535387669315
at
org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.addOperationToNode(RemoteProcedureDispatcher.java:177)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.addToRemoteDispatcher(RegionTransitionProcedure.java:259)
at
org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:207)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:345)
at
org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:97)
at
org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:876)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1556)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1344)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1855)
2018-08-27 15:43:41,578 WARN [PEWorker-16]
assignment.UnassignProcedure: Expiring
ve0530.halxg.cloudera.com,16020,1535387669315, pid=20720, ppid=20580,
state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure
table=IntegrationTestBigLinkedList, region=aa45bd6f1aabd0ee778c2664cca76d5e,
server=ve0530.halxg.cloudera.com,16020,1535387669315 rit=CLOSING,
location=ve0530.halxg.cloudera.com,16020,1535387669315;
exception=NoServerDispatchException
2018-08-27
15:43:41,578 WARN [PEWorker-16] master.ServerManager: Expiration of
ve0530.halxg.cloudera.com,16020,1535387669315 but server not online
2018-08-27 15:43:41,578 INFO
[PEWorker-16] master.ServerManager: Processing expiration of
ve0530.halxg.cloudera.com,16020,1535387669315 on
ve0524.halxg.cloudera.com,16000,1535403620803
{code}
Root cause is that the crashed server processing is stuck trying to split
logs.. We don't succeed so we are stuck in a limbo.
In the middle of writing recovered.edits files we throw a FNFE:
{code}
871008 2018-08-27 15:58:01,939 INFO [split-log-closeStream-2] wal.WALSplitter:
Rename
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/db9d8cae5a9f098be493cb79d40e031f/recovered.edits/0000000000001384580-ve0530.halxg.cloudera.com%2C16020%2C15353876
69315.1535406304424.temp to
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/db9d8cae5a9f098be493cb79d40e031f/recovered.edits/0000000000001397283
871009 2018-08-27 15:58:01,957 ERROR [split-log-closeStream-2] wal.WALSplitter:
Couldn't close log at
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C1
6020%2C1535387669315.1535406304424.temp
871010 org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException):
File does not exist:
/hbase/data/default/IntegrationTestBigLinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp
( inode 10892102) [Lease. Holder: DFSClient_NONMAPREDUCE_334149313_1,
pending creates: 37]
871011 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2672)
871012 at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:621)
871013 at
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:601)
871014 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2717)
871015 at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:882)
871016 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:555)
871017 at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
871018 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
871019 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
871020 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
871021 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
871022 at java.security.AccessController.doPrivileged(Native Method)
871023 at javax.security.auth.Subject.doAs(Subject.java:422)
871024 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
871025 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
871026
871027 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
871028 at org.apache.hadoop.ipc.Client.call(Client.java:1435)
871029 at org.apache.hadoop.ipc.Client.call(Client.java:1345)
871030 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
871031 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
871032 at com.sun.proxy.$Proxy26.complete(Unknown Source)
871033 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:484)
871034 at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
871035 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871036 at java.lang.reflect.Method.invoke(Method.java:498)
871037 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
871038 at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
871039 at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
871040 at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
871041 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
871042 at com.sun.proxy.$Proxy27.complete(Unknown Source)
871043 at sun.reflect.GeneratedMethodAccessor103.invoke(Unknown Source)
871044 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871045 at java.lang.reflect.Method.invoke(Method.java:498)
871046 at
org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372)
871047 at com.sun.proxy.$Proxy28.complete(Unknown Source)
871048 at
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:908)
871049 at
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:867)
871050 at
org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:850)
871051 at
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:805)
871052 at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
871053 at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
871054 at
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.close(ProtobufLogWriter.java:67)
871055 at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.closeWriter(WALSplitter.java:1396)
871056 at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1373)
871057 at
org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink$2.call(WALSplitter.java:1370)
871058 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
871059 at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
871060 at java.util.concurrent.FutureTask.run(FutureTask.java:266)
871061 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
871062 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
871063 at java.lang.Thread.run(Thread.java:748)
{code}
We then fail unable to close the file...
{code}
6152452:blk_1088844461_15107988] hdfs.DataStreamer: Error Recovery for
BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 in pipeline
[DatanodeInfoWithStorage[10.17.240.25:50031,DS-aa2e553d-0dc0-4199-8cda-a8c1fa453b5b,DISK],
DatanodeInfoWithStorage[10.17.240.27:500
31,DS-3665f870-e036-46f5-9a6e-44ca6991ca6b,DISK],
DatanodeInfoWithStorage[10.17.240.24:50031,DS-fa22485e-20b5-44e2-b634-33260ade2b47,DISK]]:
datanode
2(DatanodeInfoWithStorage[10.17.240.24:50031,DS-fa22485e-20b5-44e2-b634-33260ade2b47,DISK])
is bad.
871075 2018-08-27 15:58:02,038 WARN [DataStreamer for file
/hbase/data/default/IntegrationTestBigLinkedList/e9aef737cc0991c584dd842ac1d25eee/recovered.edits/0000000000001163341-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp
block BP-1622493648-10.17.240.20-151389 6152452:blk_1088844461_15107988]
hdfs.DataStreamer: DataStreamer Exception
871076 org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 does not exist.
871077 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5546)
871078 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:5618)
871079 at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:901)
871080 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:985)
871081 at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
871082 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
871083 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
871084 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
871085 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
871086 at java.security.AccessController.doPrivileged(Native Method)
871087 at javax.security.auth.Subject.doAs(Subject.java:422)
871088 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
871089 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
871090
871091 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
871092 at org.apache.hadoop.ipc.Client.call(Client.java:1435)
871093 at org.apache.hadoop.ipc.Client.call(Client.java:1345)
871094 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
871095 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
871096 at com.sun.proxy.$Proxy26.updateBlockForPipeline(Unknown Source)
871097 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolTranslatorPB.java:914)
871098 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
871099 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
871100 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871101 at java.lang.reflect.Method.invoke(Method.java:498)
871102 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
871103 at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
871104 at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
871105 at
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
871106 at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
871107 at com.sun.proxy.$Proxy27.updateBlockForPipeline(Unknown Source)
871108 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
871109 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
871110 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
871111 at java.lang.reflect.Method.invoke(Method.java:498)
871112 at
org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:372)
871113 at com.sun.proxy.$Proxy28.updateBlockForPipeline(Unknown Source)
871114 at
org.apache.hadoop.hdfs.DataStreamer.updateBlockForPipeline(DataStreamer.java:1595)
871115 at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1472)
871116 at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
871117 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
871118 2018-08-27 15:58:02,041 ERROR [split-log-closeStream-3] wal.WALSplitter:
Couldn't close log at
hdfs://ve0524.halxg.cloudera.com:8020/hbase/data/default/IntegrationTestBigLinkedList/e9aef737cc0991c584dd842ac1d25eee/recovered.edits/0000000000001163341-ve0530.halxg.cloudera.com%2C1
6020%2C1535387669315.1535406304424.temp
871119 org.apache.hadoop.ipc.RemoteException(java.io.IOException):
BP-1622493648-10.17.240.20-1513896152452:blk_1088844461_15107988 does not exist.
871120 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkUCBlock(FSNamesystem.java:5546)
871121 at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.updateBlockForPipeline(FSNamesystem.java:5618)
871122 at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.updateBlockForPipeline(NameNodeRpcServer.java:901)
871123 at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.updateBlockForPipeline(ClientNamenodeProtocolServerSideTranslatorPB.java:985)
871124 at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
871125 at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
871126 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
871127 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
871128 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
871129 at java.security.AccessController.doPrivileged(Native Method)
871130 at javax.security.auth.Subject.doAs(Subject.java:422)
871131 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
871132 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)
...
{code}
On the namenode it says:
{code}
2018-08-27 15:53:23,472 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1
on 8020, call Call#2352218 Retry#0
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from
10.17.240.26:46482: java.io.FileNotFoundException: File does not exist:
/hbase/data/default/IntegrationTestBig
LinkedList/e8f03926ae2f6aa090f1278d3ad73762/recovered.edits/0000000000001332645-ve0530.halxg.cloudera.com%2C16020%2C1535387669315.1535406304424.temp
(inode 10890719) [Lease. Holder: DFSClient_NONMAPREDUCE_-638399805_1, pending
creates: 4]
{code}
And so on.
Nasty.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)