[jira] [Commented] (HDFS-9293) FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty 'rpcId',which may cause standby NN too busy to communicate

JIRA Fri, 23 Oct 2015 02:19:09 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970696#comment-14970696
 ]


邓飞 commented on HDFS-9293:
--------------------------

It's dirty reference case on FSEditLog:

 private ThreadLocal<OpInstanceCache> cache =
      new ThreadLocal<OpInstanceCache>() {
    @Override
    protected OpInstanceCache initialValue() {
      return new OpInstanceCache();
    }
  };

If NN all handler thread initial the OpInstanceCache  instance, the the thread 
will use later.
Such as logUpdateBlocks:

public void logUpdateBlocks(String path, INodeFileUnderConstruction file,
      boolean toLogRpcIds) {
    UpdateBlocksOp op = UpdateBlocksOp.getInstance(cache.get())
      .setPath(path)
      .setBlocks(file.getBlocks());
    logRpcIds(op, toLogRpcIds);
    logEdit(op);
  }
 
/** Record the RPC IDs if necessary */
  private void logRpcIds(FSEditLogOp op, boolean toLogRpcIds) {
    if (toLogRpcIds) {
      op.setRpcClientId(Server.getClientId());
      op.setRpcCallId(Server.getCallId());
    }
  }

If client recover the pipeline at oncetime,so the FSEditLogOp  instance will 
set RpcId. Even though other UpdateBlocksOp  like addBlock whick identified as  
@Idempotent,but also will record repeat RpcId at editlog.
That made standby NN IPC handler thread parking, indirectly active NN.
And we found 2.7.1 has the same problem.

 



> FSEditLog's  'OpInstanceCache' instance of threadLocal cache exists dirty 
> 'rpcId',which may cause standby NN too busy  to communicate 
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9293
>                 URL: https://issues.apache.org/jira/browse/HDFS-9293
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.2.0, 2.7.1
>            Reporter: 邓飞
>            Assignee: 邓飞
>
>   In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog 
> slowly,and hold the fsnamesystem writelock during the work and the DN's 
> heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN 
> which can't send heartbeat  because blocking at process Standby NN Regiest 
> common(FIXED at 2.7.1).
>   Below is the standby NN  stack:
> "Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable 
> [0x00007f0dd1d76000]
>    java.lang.Thread.State: RUNNABLE
>       at java.util.PriorityQueue.remove(PriorityQueue.java:360)
>       at 
> org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
>       at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
>       - locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
>       at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
>       at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>    
>     When apply editLogOp,if the IPC retryCache is found,need  to remove the 
> previous from priorityQueue(O(N)), The updateblock is don't  need record 
> rpcId on editlog except  'client request updatePipeline',but we found many 
> 'UpdateBlocksOp' has repeat ipcId.
>      
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-9293) FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty 'rpcId',which may cause standby NN too busy to communicate

Reply via email to