[ 
https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289642#comment-14289642
 ] 

Ming Ma commented on HDFS-7609:
-------------------------------

Yeah, we also had this issue. It appears somehow an entry with the same client 
id and caller id has existed in retryCache; which ended up calling expensive 
PriorityQueue#remove function. Below is the call stack captured when standby 
was replaying the edit logs.

{noformat}
"Edit log tailer" prio=10 tid=0x00007f096d491000 nid=0x533c runnable 
[0x00007ef05ee7a000]
   java.lang.Thread.State: RUNNABLE
        at java.util.PriorityQueue.removeAt(PriorityQueue.java:605)
        at java.util.PriorityQueue.remove(PriorityQueue.java:364)
        at 
org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:218)
        at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:296)
        - locked <0x00007ef2fe306978> (a org.apache.hadoop.ipc.RetryCache)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:801)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:507)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:224)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:804)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:785)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:230)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
{noformat}


> startup used too much time to load edits
> ----------------------------------------
>
>                 Key: HDFS-7609
>                 URL: https://issues.apache.org/jira/browse/HDFS-7609
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.2.0
>            Reporter: Carrey Zhan
>         Attachments: HDFS-7609-CreateEditsLogWithRPCIDs.patch, 
> recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same 
> time under very high load, leaving behind about 100 million transactions in 
> edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be 
> needed before finish, and it was loading fsedits most of the time. I also 
> tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache. 
> So I set dfs.namenode.enable.retrycache to false, the restart process 
> finished in half an hour.
> I think the retry cached is useless during startup, at least during recover 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to