RE: NN run progressively slower

Brahma Reddy Battula Thu, 27 Sep 2018 07:49:08 -0700

Looks you are using apache-hadoop 2.6.

a) Are there any promotion failures are there..?  can be checked using 
following.

-XX:+PrintGCDetails -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1

b) SurvivorRatio is configured..? can you give all the GC params configured.? 
Even GC logs.?
the option -XX:+PrintTenuringDistribution to show the threshold and ages of the 
objects in the new generation. It is useful for observing the lifetime 
distribution of an application.

Periodic heapdump could have helped.

Did you crossed checked the following also.

i) NN and JN have dedicated disk's to for writing edit log transactions.
ii) " dfs.namenode.accesstime.precision" should n't be set low value,NN will 
update last accessed time for each file
iii) Group look up isn't taking more time,as nn will check for every request.
iv) NN log for Excessive Spew ( there are some jira's which are fixed later to 
2.6.0) like decommission nodes,...
v) Any recursive calls are very frequently?

-----Original Message-----
From: Lin,Yiqun(vip.com) [mailto:yiqun01....@vipshop.com] 
Sent: Wednesday, September 26, 2018 10:26 AM
To: Wei-Chiu Chuang <weic...@cloudera.com>
Cc: Hdfs-dev <hdfs-dev@hadoop.apache.org>
Subject: 答复: NN run progressively slower

Hi Wei-Chiu,

At the beginning, we have noted HDFS-9260 which changing the structure of the 
stored blocks. With multiple adding/removing under this structure, there will 
be a performance degradation.
But we think this will reach a stable state, and HDFS-9260 isn’t the root cause 
from our point of view.

>Did you observe clients failing to close file due to insufficient number of 
>block replicas? Did NN fail over?
No failing close file warning in client. And no NN fail over happened.

>Did you have gc logging enabled? Any chance to take a heap dump and analyze 
>what's in there?
We have enabled gc logging but haven’t take the analysis for the NN heap dump. 
From the gc log, we find average time of ygc around 0.02s but total ygc time 
within one day are increasing day by day.

>There were quite some NN scalability and GC improvements between CDH5.5 ~ 
>CDH5.8 time frame. We have customers at/beyond your scale in your version but 
>I don't think I've heard similar symptoms.
Thanks for sharing this.

Thanks
Yiqun

发件人: Wei-Chiu Chuang [mailto:weic...@cloudera.com]
发送时间: 2018年9月26日 11:14
收件人: 林意群[运营中心]
抄送: Hdfs-dev
主题: Re: NN run progressively slower

Yiqun,
Is this related to HDFS-9260?
Note that HDFS-9260 was backported since CDH5.7 and above.

I'm interested to learn more. Did you observe clients failing to close file due 
to insufficient number of block replicas? Did NN fail over?
Did you have gc logging enabled? Any chance to take a heap dump and analyze 
what's in there?

There were quite some NN scalability and GC improvements between CDH5.5 ~ 
CDH5.8 time frame. We have customers at/beyond your scale in your version but I 
don't think I've heard similar symptoms.

Regards

On Tue, Sep 25, 2018 at 2:04 AM Lin,Yiqun(vip.com<http://vip.com>) 
<yiqun01....@vipshop.com<mailto:yiqun01....@vipshop.com>> wrote:
Hi hdfs developers:

We meet a bad problem after rolling upgrade our hadoop version from 
2.5.0-cdh5.3.2 to 2.6.0-cdh5.13.1. The problem is that we find NN running slow 
periodically (around a week). Concretely to say, For example, we startup NN on 
Monday, it will run fast. But time coming to Weekends, our cluster will become 
very slow.

In the beginning, we thought maybe some FSN lock caused by this. And we did 
some improvements for this, e.g. configurable the remove block interval, print 
FSN lock elapsed time. After this, the problem still exists, :(. So we suspect 
this maybe not a hdfs rpc problem.

Finally we find a related phenomenon: every time NN runs slow, its old gen 
reaches a high value, around 100GB. Actually, NN total metadata size is just 
around 40GB in our clsuter. So for the temporary solution, we reduce the heap 
space and trigger full gc frequently. Now it looks better than before but we 
haven’t found the root cause of this. Not so sure if this is a jvm tuning 
problem or a hdfs bug?

Anyone who has met the similar problem in this version? Why the NN old gen 
space greatly increased?

Some information of our env:
JDK1.8
500+ Nodes, 150 million blocks, around 40GB metadata size will be used.

Appreciate if anyone who can share your comments.

Thanks
Yiqun.
本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人，谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容，或将其用于其他任何目的或向任何人披露。谢谢您的合作！
 This communication is intended only for the addressee(s) and may contain 
information that is privileged and confidential. You are hereby notified that, 
if you are not an intended recipient listed above, or an authorized employee or 
agent of an addressee of this communication responsible for delivering e-mail 
messages to an intended recipient, any dissemination, distribution or 
reproduction of this communication (including any attachments hereto) is 
strictly prohibited. If you have received this communication in error, please 
notify us immediately by a reply e-mail addressed to the sender and permanently 
delete the original e-mail communication and any attachments from all storage 
devices without making or otherwise retaining a copy.

--
A very happy Clouderan
本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人，谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容，或将其用于其他任何目的或向任何人披露。谢谢您的合作！
 This communication is intended only for the addressee(s) and may contain 
information that is privileged and confidential. You are hereby notified that, 
if you are not an intended recipient listed above, or an authorized employee or 
agent of an addressee of this communication responsible for delivering e-mail 
messages to an intended recipient, any dissemination, distribution or 
reproduction of this communication (including any attachments hereto) is 
strictly prohibited. If you have received this communication in error, please 
notify us immediately by a reply e-mail addressed to the sender and permanently 
delete the original e-mail communication and any attachments from all storage 
devices without making or otherwise retaining a copy.

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

RE: NN run progressively slower

Reply via email to