[
https://issues.apache.org/jira/browse/HDFS-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531371#comment-16531371
]
Andrew Wang commented on HDFS-13672:
------------------------------------
Hi Gabor, thanks for working on this,
I don't think it's thread safe to drop the lock while holding onto an iterator
like this. This is a LinkedSetIterator and will throw a
ConcurrentModificationException if the set is changed underneath it. We need a
way to safely resume at a mid-point, and that seems a bit hard with
LinkedSetIterator as it is.
Since I think the common case here is that there are zero lazy persist files, a
better (though different) change would be to skip running this scrubber
entirely if there aren't any lazy persist files. I'm hoping there's an easy way
to add a counter for this (or some existing way to query if there are any lazy
persist files).
We also need unit tests for new changes like this. I think you also typo'd the
config key name with "sec" instead of "millis" or "ms". Config keys also needed
to be added to hdfs-default.xml with a description for documentation purposes.
> clearCorruptLazyPersistFiles could crash NameNode
> -------------------------------------------------
>
> Key: HDFS-13672
> URL: https://issues.apache.org/jira/browse/HDFS-13672
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Wei-Chiu Chuang
> Assignee: Gabor Bota
> Priority: Major
> Attachments: HDFS-13672.001.patch, HDFS-13672.002.patch
>
>
> I started a NameNode on a pretty large fsimage. Since the NameNode is started
> without any DataNodes, all blocks (100 million) are "corrupt".
> Afterwards I observed FSNamesystem#clearCorruptLazyPersistFiles() held write
> lock for a long time:
> {noformat}
> 18/06/12 12:37:03 INFO namenode.FSNamesystem: FSNamesystem write lock held
> for 46024 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:198)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1689)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:5532)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:5543)
> java.lang.Thread.run(Thread.java:748)
> Number of suppressed write-lock reports: 0
> Longest write-lock held interval: 46024
> {noformat}
> Here's the relevant code:
> {code}
> writeLock();
> try {
> final Iterator<BlockInfo> it =
> blockManager.getCorruptReplicaBlockIterator();
> while (it.hasNext()) {
> Block b = it.next();
> BlockInfo blockInfo = blockManager.getStoredBlock(b);
> if (blockInfo.getBlockCollection().getStoragePolicyID() ==
> lpPolicy.getId()) {
> filesToDelete.add(blockInfo.getBlockCollection());
> }
> }
> for (BlockCollection bc : filesToDelete) {
> LOG.warn("Removing lazyPersist file " + bc.getName() + " with no
> replicas.");
> changed |= deleteInternal(bc.getName(), false, false, false);
> }
> } finally {
> writeUnlock();
> }
> {code}
> In essence, the iteration over corrupt replica list should be broken down
> into smaller iterations to avoid a single long wait.
> Since this operation holds NameNode write lock for more than 45 seconds, the
> default ZKFC connection timeout, it implies an extreme case like this (100
> million corrupt blocks) could lead to NameNode failover.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]