[ 
https://issues.apache.org/jira/browse/HDFS-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16554351#comment-16554351
 ] 

Gabor Bota commented on HDFS-13672:
-----------------------------------

Adding lazy persist would be a great idea. [~xiaochen] if you can create a 
follow-up jira if you want to.
Another good idea is to create a service where you can iterate through a list 
of elements with a gained writeLock - and each element can be run through a 
lambda function. We may want to create a jira for that. (kudos for 
[~andrew.wang])

So to summarize this issue:
* It's not worth making a behavior change for this since this long blocking 
scan probably will only happen during debugging situations (and we have a 
workaround)
* The workaround is to disable the scrubber interval when debugging. In the 
real world/customer environments, there are no cases when there are so many 
corrupted lazy persist files.
* I will close this jira as won't fix and if there's no follow-up jiras, I'll 
create one tomorrow CET.

> clearCorruptLazyPersistFiles could crash NameNode
> -------------------------------------------------
>
>                 Key: HDFS-13672
>                 URL: https://issues.apache.org/jira/browse/HDFS-13672
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Wei-Chiu Chuang
>            Assignee: Gabor Bota
>            Priority: Major
>         Attachments: HDFS-13672.001.patch, HDFS-13672.002.patch, 
> HDFS-13672.003.patch
>
>
> I started a NameNode on a pretty large fsimage. Since the NameNode is started 
> without any DataNodes, all blocks (100 million) are "corrupt".
> Afterwards I observed FSNamesystem#clearCorruptLazyPersistFiles() held write 
> lock for a long time:
> {noformat}
> 18/06/12 12:37:03 INFO namenode.FSNamesystem: FSNamesystem write lock held 
> for 46024 ms via
> java.lang.Thread.getStackTrace(Thread.java:1559)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:198)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1689)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:5532)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:5543)
> java.lang.Thread.run(Thread.java:748)
>         Number of suppressed write-lock reports: 0
>         Longest write-lock held interval: 46024
> {noformat}
> Here's the relevant code:
> {code}
>       writeLock();
>       try {
>         final Iterator<BlockInfo> it =
>             blockManager.getCorruptReplicaBlockIterator();
>         while (it.hasNext()) {
>           Block b = it.next();
>           BlockInfo blockInfo = blockManager.getStoredBlock(b);
>           if (blockInfo.getBlockCollection().getStoragePolicyID() == 
> lpPolicy.getId()) {
>             filesToDelete.add(blockInfo.getBlockCollection());
>           }
>         }
>         for (BlockCollection bc : filesToDelete) {
>           LOG.warn("Removing lazyPersist file " + bc.getName() + " with no 
> replicas.");
>           changed |= deleteInternal(bc.getName(), false, false, false);
>         }
>       } finally {
>         writeUnlock();
>       }
> {code}
> In essence, the iteration over corrupt replica list should be broken down 
> into smaller iterations to avoid a single long wait.
> Since this operation holds NameNode write lock for more than 45 seconds, the 
> default ZKFC connection timeout, it implies an extreme case like this (100 
> million corrupt blocks) could lead to NameNode failover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to