[
https://issues.apache.org/jira/browse/HDFS-9293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14970696#comment-14970696
]
邓飞 commented on HDFS-9293:
--------------------------
It's dirty reference case on FSEditLog:
private ThreadLocal<OpInstanceCache> cache =
new ThreadLocal<OpInstanceCache>() {
@Override
protected OpInstanceCache initialValue() {
return new OpInstanceCache();
}
};
If NN all handler thread initial the OpInstanceCache instance, the the thread
will use later.
Such as logUpdateBlocks:
public void logUpdateBlocks(String path, INodeFileUnderConstruction file,
boolean toLogRpcIds) {
UpdateBlocksOp op = UpdateBlocksOp.getInstance(cache.get())
.setPath(path)
.setBlocks(file.getBlocks());
logRpcIds(op, toLogRpcIds);
logEdit(op);
}
/** Record the RPC IDs if necessary */
private void logRpcIds(FSEditLogOp op, boolean toLogRpcIds) {
if (toLogRpcIds) {
op.setRpcClientId(Server.getClientId());
op.setRpcCallId(Server.getCallId());
}
}
If client recover the pipeline at oncetime,so the FSEditLogOp instance will
set RpcId. Even though other UpdateBlocksOp like addBlock whick identified as
@Idempotent,but also will record repeat RpcId at editlog.
That made standby NN IPC handler thread parking, indirectly active NN.
And we found 2.7.1 has the same problem.
> FSEditLog's 'OpInstanceCache' instance of threadLocal cache exists dirty
> 'rpcId',which may cause standby NN too busy to communicate
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-9293
> URL: https://issues.apache.org/jira/browse/HDFS-9293
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.2.0, 2.7.1
> Reporter: 邓飞
> Assignee: 邓飞
>
> In our cluster (hadoop 2.2.0-HA,700+ DN),we found standby NN tail editlog
> slowly,and hold the fsnamesystem writelock during the work and the DN's
> heartbeart/blockreport IPC request blocked.Lead to Active NN remove stale DN
> which can't send heartbeat because blocking at process Standby NN Regiest
> common(FIXED at 2.7.1).
> Below is the standby NN stack:
> "Edit log tailer" prio=10 tid=0x00007f28fcf35800 nid=0x1a7d runnable
> [0x00007f0dd1d76000]
> java.lang.Thread.State: RUNNABLE
> at java.util.PriorityQueue.remove(PriorityQueue.java:360)
> at
> org.apache.hadoop.util.LightWeightCache.put(LightWeightCache.java:217)
> at org.apache.hadoop.ipc.RetryCache.addCacheEntry(RetryCache.java:270)
> - locked <0x00007f12817714b8> (a org.apache.hadoop.ipc.RetryCache)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntry(FSNamesystem.java:724)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:406)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:199)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:112)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:733)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:227)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:321)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:279)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:296)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:456)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:292)
>
> When apply editLogOp,if the IPC retryCache is found,need to remove the
> previous from priorityQueue(O(N)), The updateblock is don't need record
> rpcId on editlog except 'client request updatePipeline',but we found many
> 'UpdateBlocksOp' has repeat ipcId.
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)