[
https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yi Liu resolved HDFS-9293.
--------------------------
Resolution: Duplicate
> FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty
> 'rpcId',which may cause standby NN too busy to communicate
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-9293
> URL: https://issues.apache.org/jira/browse/HDFS-9293
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.2.0, 2.7.1
> Reporter: 邓飞
> Assignee: 邓飞
> Fix For: 2.7.1
>
>
> In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog
> slowly,and hold the fsnamesystem writelock during the work and the DN's
> heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN
> which can't send heartbeat because blocking at process Standby NN Regiest
> common(FIXED at 2.7.1).
> Below is the standby NN stack:
> "Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable
> [0x00007f0dd1d76000]
> java.lang.Thread.State: RUNNABLE
> at java.util.PriorityQueue.remove(PriorityQueue.java:360)
> at
> org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
> at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
> - locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
> When apply editLogOp,if the IPC retryCache is found,need to remove the
> previous from priorityQueue(O(N)), The updateblock is don't need record
> rpcId on editlog except 'client request updatePipeline',but we found many
> 'UpdateBlocksOp' has repeat ipcId.
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)