[ https://issues.apache.org/jira/browse/HDFS-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554351#comment-16554351 ]
Gabor Bota commented on HDFS-13672: ----------------------------------- Adding lazy persist would be a great idea. [~xiaochen] if you can create a follow-up jira if you want to. Another good idea is to create a service where you can iterate through a list of elements with a gained writeLock - and each element can be run through a lambda function. We may want to create a jira for that. (kudos for [~andrew.wang]) So to summarize this issue: * It's not worth making a behavior change for this since this long blocking scan probably will only happen during debugging situations (and we have a workaround) * The workaround is to disable the scrubber interval when debugging. In the real world/customer environments, there are no cases when there are so many corrupted lazy persist files. * I will close this jira as won't fix and if there's no follow-up jiras, I'll create one tomorrow CET. > clearCorruptLazyPersistFiles could crash NameNode > ------------------------------------------------- > > Key: HDFS-13672 > URL: https://issues.apache.org/jira/browse/HDFS-13672 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Wei-Chiu Chuang > Assignee: Gabor Bota > Priority: Major > Attachments: HDFS-13672.001.patch, HDFS-13672.002.patch, > HDFS-13672.003.patch > > > I started a NameNode on a pretty large fsimage. Since the NameNode is started > without any DataNodes, all blocks (100 million) are "corrupt". > Afterwards I observed FSNamesystem#clearCorruptLazyPersistFiles() held write > lock for a long time: > {noformat} > 18/06/12 12:37:03 INFO namenode.FSNamesystem: FSNamesystem write lock held > for 46024 ms via > java.lang.Thread.getStackTrace(Thread.java:1559) > org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945) > org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:198) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1689) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:5532) > org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:5543) > java.lang.Thread.run(Thread.java:748) > Number of suppressed write-lock reports: 0 > Longest write-lock held interval: 46024 > {noformat} > Here's the relevant code: > {code} > writeLock(); > try { > final Iterator<BlockInfo> it = > blockManager.getCorruptReplicaBlockIterator(); > while (it.hasNext()) { > Block b = it.next(); > BlockInfo blockInfo = blockManager.getStoredBlock(b); > if (blockInfo.getBlockCollection().getStoragePolicyID() == > lpPolicy.getId()) { > filesToDelete.add(blockInfo.getBlockCollection()); > } > } > for (BlockCollection bc : filesToDelete) { > LOG.warn("Removing lazyPersist file " + bc.getName() + " with no > replicas."); > changed |= deleteInternal(bc.getName(), false, false, false); > } > } finally { > writeUnlock(); > } > {code} > In essence, the iteration over corrupt replica list should be broken down > into smaller iterations to avoid a single long wait. > Since this operation holds NameNode write lock for more than 45 seconds, the > default ZKFC connection timeout, it implies an extreme case like this (100 > million corrupt blocks) could lead to NameNode failover. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org