Keith Turner created HDFS-8406:
----------------------------------
Summary: Lease recovery continually failed
Key: HDFS-8406
URL: https://issues.apache.org/jira/browse/HDFS-8406
Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 2.6.0
Reporter: Keith Turner
While testing Accumulo on a cluster and killing processes, I ran into a
situation where the lease on an accumulo write ahead log in HDFS could not be
recovered. Even restarting HDFS and Accumulo would not fix the problem.
The following message was seen in an Accumulo tablet server log immediately
before the tablet server was killed.
{noformat}
2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream
ResponseProcessor exception for block
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
java.io.IOException: Bad response ERROR for block
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 from datanode
10.1.5.9:50010
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 in pipeline
10.1.5.55:50010, 10.1.5.9:5
{noformat}
Before recovering data from a write ahead log, the Accumulo master attempts to
recover the lease. This repeatedly failed with messages like the following.
{noformat}
2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering
lease on
hdfs://10.1.5.6:10000/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
failed to create file
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because
pendingCreates is non-null but no leases found.
{noformat}
Below is some info from the NN logs for the problematic file.
{noformat}
[ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245
hadoop-ec2-user-namenode-leader2.log
2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
allocateBlock:
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2.
BP-802741494-10.1.5.6-1431557089849
blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* fsync:
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for
DFSClient_NONMAPREDUCE_-1128465883_16
2015-05-14 17:14:49,288 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.
Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1],
src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 from
client DFSClient_NONMAPREDUCE_-1128465883_16
2015-05-14 17:14:49,288 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.
Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1],
src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: File
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been
closed. Lease recovery is in progress. RecoveryId = 192257 for block
blk_1073932823_192060{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2,
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
java.lang.IllegalStateException: Failed to finalize INodeFile
3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where
blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2,
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
2015-05-14 17:14:54,292 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1
on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from
10.1.5.158:53784 Call#529 Retry#0:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create
file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because
pendingCreates is non-null but no leases found.
2015-05-14 17:14:54,319 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.append: failed to create file
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because
pendingCreates is non-null but no leases found.
2015-05-14 17:14:54,320 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6
on 10000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.append from
10.1.5.158:53784 Call#530 Retry#0:
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create
file /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for
DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because
pendingCreates is non-null but no leases found.
{noformat}
Below is expanded info from the NN logs about the exception mentioned above.
{noformat}
2015-05-14 17:14:51,704 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5
on 10000, call
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
from 10.1.5.45:35687 Call#66737 Retry#0
java.lang.IllegalStateException: Failed to finalize INodeFile
3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where
blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2,
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}].
at
com.google.common.base.Preconditions.checkState(Preconditions.java:172)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.assertAllBlocksComplete(INodeFile.java:214)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.toCompleteFile(INodeFile.java:201)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.finalizeINodeFileUnderConstruction(FSNamesystem.java:4663)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.closeFileCommitBlocks(FSNamesystem.java:4878)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:4842)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.commitBlockSynchronization(NameNodeRpcServer.java:734)
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.commitBlockSynchronization(DatanodeProtocolServerSideTranslatorPB.java:270)
at
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26394)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
{noformat}
Below is the output of running fsck on the file.
{noformat}
[ec2-user@leader2 sbin]$ hdfs fsck
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 -files -blocks
-locations -openforwrite
Connecting to namenode via http://leader2:50070
FSCK started by ec2-user (auth:SIMPLE) from /10.1.5.6 for path
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 at Thu May 14
19:03:21 UTC 2015
/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 541965482
bytes, 1 block(s), OPENFORWRITE: Under replicated
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED,
primaryNodeIndex=2,
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}.
Target Replicas is 3 but found 2 replica(s).
0.
BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192257{blockUCState=COMMITTED,
primaryNodeIndex=2,
replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RWR],
ReplicaUnderConstruction[[DISK]DS-b42bf32b-dc72-4be5-bf33-e1ee73825420:NORMAL:10.1.5.131:50010|RBW]]}
len=541965482 repl=2 [10.1.5.55:50010, 10.1.5.45:50010, 10.1.5.9:50010,
10.1.5.131:50010]
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)