Wei-Chiu Chuang created HDFS-13672:
--------------------------------------
Summary: clearCorruptLazyPersistFiles could crash NameNode
Key: HDFS-13672
URL: https://issues.apache.org/jira/browse/HDFS-13672
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Wei-Chiu Chuang
I started a NameNode on a pretty large fsimage. Since the NameNode is started
without any DataNodes, all blocks (100 million) are "corrupt".
Afterwards I observed FSNamesystem#clearCorruptLazyPersistFiles() held write
lock for a long time:
{noformat}
18/06/12 12:37:03 INFO namenode.FSNamesystem: FSNamesystem write lock held for
46024 ms via
java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:198)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1689)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:5532)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:5543)
java.lang.Thread.run(Thread.java:748)
Number of suppressed write-lock reports: 0
Longest write-lock held interval: 46024
{noformat}
Here's the relevant code:
{code}
writeLock();
try {
final Iterator<BlockInfo> it =
blockManager.getCorruptReplicaBlockIterator();
while (it.hasNext()) {
Block b = it.next();
BlockInfo blockInfo = blockManager.getStoredBlock(b);
if (blockInfo.getBlockCollection().getStoragePolicyID() ==
lpPolicy.getId()) {
filesToDelete.add(blockInfo.getBlockCollection());
}
}
for (BlockCollection bc : filesToDelete) {
LOG.warn("Removing lazyPersist file " + bc.getName() + " with no
replicas.");
changed |= deleteInternal(bc.getName(), false, false, false);
}
} finally {
writeUnlock();
}
{code}
In essence, the iteration over corrupt replica list should be broken down into
smaller iterations to avoid a single long wait.
Since this operation holds NameNode write lock for more than 45 seconds, the
default ZKFC connection timeout, it implies an extreme case like this (100
million corrupt blocks) could lead to NameNode failover.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]