[
https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289642#comment-14289642
]
Ming Ma commented on HDFS-7609:
-------------------------------
Yeah, we also had this issue. It appears somehow an entry with the same client
id and caller id has existed in retryCache; which ended up calling expensive
PriorityQueue#remove function. Below is the call stack captured when standby
was replaying the edit logs.
{noformat}
"Edit log tailer" prio=10 tid=0x00007f096d491000 nid=0x533c runnable
[0x00007ef05ee7a000]
java.lang.Thread.State: RUNNABLE
at java.util.PriorityQueue.removeAt(PriorityQueue.java:605)
at java.util.PriorityQueue.remove(PriorityQueue.java:364)
at
org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:218)
at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:296)
- locked <0x00007ef2fe306978> (a org.apache.hadoop.ipc.RetryCache)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:801)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:507)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:224)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:133)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:804)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:785)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:230)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:324)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:282)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:299)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:295)
{noformat}
> startup used too much time to load edits
> ----------------------------------------
>
> Key: HDFS-7609
> URL: https://issues.apache.org/jira/browse/HDFS-7609
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Affects Versions: 2.2.0
> Reporter: Carrey Zhan
> Attachments: HDFS-7609-CreateEditsLogWithRPCIDs.patch,
> recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same
> time under very high load, leaving behind about 100 million transactions in
> edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be
> needed before finish, and it was loading fsedits most of the time. I also
> tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache.
> So I set dfs.namenode.enable.retrycache to false, the restart process
> finished in half an hour.
> I think the retry cached is useless during startup, at least during recover
> process.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)