[
https://issues.apache.org/jira/browse/HDFS-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787803#comment-13787803
]
Chris Nauroth commented on HDFS-5313:
-------------------------------------
I suspect we'll need to arrange for the edit log loading code path to call
{{FSDirectory#unprotectedSetCacheReplication}} instead. This is similar to how
{{OP_SET_REPLICATION}} is handled.
I wondered why {{TestCacheReplicationManager#testCacheManagerRestart}} didn't
catch this problem. It turns out this is because the test only adds cache
directives for paths that don't exist, so on restart,
{{CacheManager#unprotectedAddEntry}} skips the call to
{{FSNamesystem#setCacheReplicationInt}}. I see this in my test logs:
{code}
2013-10-06 15:09:13,843 WARN namenode.CacheManager
(CacheManager.java:unprotectedAddEntry(180)) - Path /party-0 is not a file
{code}
I suggest that we update the test so that it really creates those files.
I also wondered why I didn't catch this in my earlier manual testing of
HDFS-5119. I suspect I masked the problem by running {{hdfs dfsadmin
-saveNamespace}} to force a checkpoint.
Following is the full stack trace showing where this blocks.
{code}
"main" prio=10 tid=0x00007fc344010000 nid=0x1238 waiting on condition
[0x00007fc34c14f000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000c214a868> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2173)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.waitForReady(FSDirectory.java:247)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.setCacheReplication(FSDirectory.java:1108)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setCacheReplicationInt(FSNamesystem.java:1945)
at
org.apache.hadoop.hdfs.server.namenode.CacheManager.unprotectedAddEntry(CacheManager.java:178)
at
org.apache.hadoop.hdfs.server.namenode.CacheManager.unprotectedAddDirective(CacheManager.java:253)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:647)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:205)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:118)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:730)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:644)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:261)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:812)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:590)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:445)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:493)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:691)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:676)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1265)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1331)
{code}
> NameNode hangs during startup trying to apply
> OP_ADD_PATH_BASED_CACHE_DIRECTIVE.
> --------------------------------------------------------------------------------
>
> Key: HDFS-5313
> URL: https://issues.apache.org/jira/browse/HDFS-5313
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: HDFS-4949
> Reporter: Chris Nauroth
>
> During namenode startup, if the edits contain a
> {{OP_ADD_PATH_BASED_CACHE_DIRECTIVE}} for an existing file, then the process
> hangs while trying to apply the op. This is because of a call to
> {{FSDirectory#setCacheReplication}}, which calls
> {{FSDirectory#waitForReady}}, but of course nothing is ever going to mark the
> directory ready, because it's still in the process of loading.
--
This message was sent by Atlassian JIRA
(v6.1#6144)