Wei-Chiu Chuang created HDFS-13672: -------------------------------------- Summary: clearCorruptLazyPersistFiles could crash NameNode Key: HDFS-13672 URL: https://issues.apache.org/jira/browse/HDFS-13672 Project: Hadoop HDFS Issue Type: Bug Reporter: Wei-Chiu Chuang
I started a NameNode on a pretty large fsimage. Since the NameNode is started without any DataNodes, all blocks (100 million) are "corrupt". Afterwards I observed FSNamesystem#clearCorruptLazyPersistFiles() held write lock for a long time: {noformat} 18/06/12 12:37:03 INFO namenode.FSNamesystem: FSNamesystem write lock held for 46024 ms via java.lang.Thread.getStackTrace(Thread.java:1559) org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945) org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:198) org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1689) org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:5532) org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:5543) java.lang.Thread.run(Thread.java:748) Number of suppressed write-lock reports: 0 Longest write-lock held interval: 46024 {noformat} Here's the relevant code: {code} writeLock(); try { final Iterator<BlockInfo> it = blockManager.getCorruptReplicaBlockIterator(); while (it.hasNext()) { Block b = it.next(); BlockInfo blockInfo = blockManager.getStoredBlock(b); if (blockInfo.getBlockCollection().getStoragePolicyID() == lpPolicy.getId()) { filesToDelete.add(blockInfo.getBlockCollection()); } } for (BlockCollection bc : filesToDelete) { LOG.warn("Removing lazyPersist file " + bc.getName() + " with no replicas."); changed |= deleteInternal(bc.getName(), false, false, false); } } finally { writeUnlock(); } {code} In essence, the iteration over corrupt replica list should be broken down into smaller iterations to avoid a single long wait. Since this operation holds NameNode write lock for more than 45 seconds, the default ZKFC connection timeout, it implies an extreme case like this (100 million corrupt blocks) could lead to NameNode failover. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org