[ 
https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

邓飞 updated HDFS-9293:
---------------------
    Description: 
  In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog 
slowly,and hold the fsnamesystem writelock during the work and the DN's 
heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN 
which can't send heartbeat  because blocking at process Standby NN Regiest 
common(FIXED at 2.7.1).

  Below is the standby NN  stack:

"Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable 
[0x00007f0dd1d76000]
   java.lang.Thread.State: RUNNABLE
        at java.util.PriorityQueue.remove(PriorityQueue.java:360)
        at 
org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
        at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
        - locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
        at 
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
        at 
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
   
    When apply editLogOp,if the IPC retryCache is found,need  to remove the 
previous from priorityQueue(O(N)), The updateblock is don't  need record rpcId 
on editlog except  'client request updatePipeline',but we found many 
'UpdateBlocksOp' has repeat ipcId at editlog.

     
  

> FSEditLog's  'OpInstanceCache' instance of threadLocal cache exists dirty 
> 'rpcId',which may cause standby NN too busy  to communicate 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9293
>                 URL: https://issues.apache.org/jira/browse/HDFS-9293
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.2.0, 2.7.1
>            Reporter: 邓飞
>            Assignee: 邓飞
>
>   In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog 
> slowly,and hold the fsnamesystem writelock during the work and the DN's 
> heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN 
> which can't send heartbeat  because blocking at process Standby NN Regiest 
> common(FIXED at 2.7.1).
>   Below is the standby NN  stack:
> "Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable 
> [0x00007f0dd1d76000]
>    java.lang.Thread.State: RUNNABLE
>       at java.util.PriorityQueue.remove(PriorityQueue.java:360)
>       at 
> org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
>       at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
>       - locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>       at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>    
>     When apply editLogOp,if the IPC retryCache is found,need  to remove the 
> previous from priorityQueue(O(N)), The updateblock is don't  need record 
> rpcId on editlog except  'client request updatePipeline',but we found many 
> 'UpdateBlocksOp' has repeat ipcId at editlog.
>      
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to