jiangyu created HDFS-9143:
-----------------------------
Summary: updateCountForQuota method during EditlogTailer loadEdit
can make SNN timeout very often
Key: HDFS-9143
URL: https://issues.apache.org/jira/browse/HDFS-9143
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.6.0, 2.4.0
Reporter: jiangyu
Priority: Minor
I have seen many logs from datanodes in our cluster reporting socket timeout
when sending heartbeat or blockReceivedAndDeleted to Standby NameNode, but it
never happen to Active NameNode.
At first, i thought it maybe caused by Editlog Tailer fetch Editlog too much
making full gc, but after i watched the gc log, it is not. So i investigate the
code path and log, find it only take very few seconds for the SNN to fetch the
journal and merge it. But when you open the webpage of SNN during merge
processing, it can not response like stop the world time of full GC, but there
is no gc at that time. So i jstack SNN for some time, and finding all the time
consumed by updateCountForQuota method in FSImage.
The updateCountForQuota is called ervry time when loadEdits, it update the
count of each directory with quota in the namespace from ROOT, besides it hold
the write lock of FSImage, so every time when SNN merge the edit from JN, it is
always making the stop world.
I don't think it is necessary for SNN to updateCountForQuota everytime when
tail the edit, when trasition to Active, it call updateCountForQuota and never
missing any quota data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)