[ 
https://issues.apache.org/jira/browse/HDFS-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787803#comment-13787803
 ] 

Chris Nauroth commented on HDFS-5313:
-------------------------------------

I suspect we'll need to arrange for the edit log loading code path to call 
{{FSDirectory#unprotectedSetCacheReplication}} instead.  This is similar to how 
{{OP_SET_REPLICATION}} is handled.

I wondered why {{TestCacheReplicationManager#testCacheManagerRestart}} didn't 
catch this problem.  It turns out this is because the test only adds cache 
directives for paths that don't exist, so on restart, 
{{CacheManager#unprotectedAddEntry}} skips the call to 
{{FSNamesystem#setCacheReplicationInt}}.  I see this in my test logs:

{code}
2013-10-06 15:09:13,843 WARN  namenode.CacheManager 
(CacheManager.java:unprotectedAddEntry(180)) - Path /party-0 is not a file
{code}

I suggest that we update the test so that it really creates those files.

I also wondered why I didn't catch this in my earlier manual testing of 
HDFS-5119.  I suspect I masked the problem by running {{hdfs dfsadmin 
-saveNamespace}} to force a checkpoint.

Following is the full stack trace showing where this blocks.  

{code}
"main" prio=10 tid=0x00007fc344010000 nid=0x1238 waiting on condition 
[0x00007fc34c14f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c214a868> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2173)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.waitForReady(FSDirectory.java:247)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.setCacheReplication(FSDirectory.java:1108)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setCacheReplicationInt(FSNamesystem.java:1945)
        at 
org.apache.hadoop.hdfs.server.namenode.CacheManager.unprotectedAddEntry(CacheManager.java:178)
        at 
org.apache.hadoop.hdfs.server.namenode.CacheManager.unprotectedAddDirective(CacheManager.java:253)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:647)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:205)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:118)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:730)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:644)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:261)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:812)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:590)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:445)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:493)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:691)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:676)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1265)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1331)
{code}


> NameNode hangs during startup trying to apply 
> OP_ADD_PATH_BASED_CACHE_DIRECTIVE.
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-5313
>                 URL: https://issues.apache.org/jira/browse/HDFS-5313
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: HDFS-4949
>            Reporter: Chris Nauroth
>
> During namenode startup, if the edits contain a 
> {{OP_ADD_PATH_BASED_CACHE_DIRECTIVE}} for an existing file, then the process 
> hangs while trying to apply the op.  This is because of a call to 
> {{FSDirectory#setCacheReplication}}, which calls 
> {{FSDirectory#waitForReady}}, but of course nothing is ever going to mark the 
> directory ready, because it's still in the process of loading.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to