Wei-Chiu Chuang created HDFS-11219:
--------------------------------------
Summary: Lease is leaked if block has no replica and length > 0
Key: HDFS-11219
URL: https://issues.apache.org/jira/browse/HDFS-11219
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Reporter: Wei-Chiu Chuang
I am not sure how it started initially, because part of NameNode log has rolled.
Background: a HBase cluster's WAL file got corrupt when it is splitted. The
last block went missing, and NameNode thinks the length of the last block is 83
bytes.
When I tried to recover lease manually via hdfs debug -recoverLease, NameNode
was unable to recover the lease with the following log message:
{noformat}
2016-10-12 16:39:32,885 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease. Holder:
DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3],
src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
2016-10-12 16:39:32,885 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.
Holder: DFSClient_NONMAPREDUCE_751860335_1, pendingcreates: 3],
src=/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
from client DFSClient_NONMAPREDUCE_751860335_1
2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8020: org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease
from 192.168.4.86:35276 Call#708331 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
2016-10-12 16:39:32,885 DEBUG org.apache.hadoop.ipc.Server: got #708331
2016-10-12 16:39:32,886 WARN BlockStateChange: BLOCK*
BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease removed.
2016-10-12 16:39:32,886 WARN org.apache.hadoop.hdfs.StateChange: DIR*
NameSystem.internalReleaseLease: File
/hbase/WALs/hadoopdev6.example.com,60020,1465915112353-splitting/hadoopdev6.example.com%2C60020%2C1465915112353.null0.1471340470256
has not been closed. Lease recovery is in progress. RecoveryId = 1099610544470
for block blk_1074528903_1099610239882 {blockUCState=UNDER_RECOVERY, primary
NodeIndex=-1, replicas=[]}
2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8020: responding to
org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from
192.168.4.86:35276 Call#708331 Retry#0 Wrote 36 bytes.
2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 8020: responding to
org.apache.hadoop.hdfs.protocol.ClientProtocol.recoverLease from
192.168.4.86:35276 Call#708331 Retry#0
2016-10-12 16:39:32,887 DEBUG org.apache.hadoop.ipc.Server: Served:
recoverLease queueTime= 0 procesingTime= 2
{noformat}
Despite the log says {{No blocks found, lease removed.}} the lease is actually
not removed, and it is preserved into fsimage. The files remains open for write
for couple of months.
Initially, I thought this is HDFS-10763. But checking the error message on
HDFS-10763, this is definitely not the same error, because it doesn't have this
error message:
{noformat}
Failed to RECOVER_LEASE /xyz/xyz for user1 on
0.0.0.1 because the file is under construction but no leases found.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:295
{noformat}
I looked at the patch at HDFS-8344, but it doesn't seem to address the issue
when replica size = 0.
On some datanodes I saw some effort to recover blocks with 83 bytes in length,
but because NameNode log has rolled, I can't be certain if these are the
splitted WAL file last blocks. Some of these blocks are deleted soon
thereafter, so maybe there are some wild race conditions where NameNode deleted
the blocks but didn't delete the file.
{noformat}
2016-09-19 11:54:41,093 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
initReplicaRecovery: blk_1074562914_1099610357708, recoveryId=1099610357849,
replica=ReplicaWaitingToBeRecovered, blk_1074562914_1099610357708, RWR
getNumBytes() = 83
getBytesOnDisk() = 83
getVisibleLength()= -1
getVolume() = /data/dfs/dn/current
getBlockFile() =
/data/dfs/dn/current/BP-1231527650-25.109.5.133-1392914565955/current/rbw/blk_1074562914
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]