[
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013221#comment-17013221
]
Surendra Singh Lilhore commented on HDFS-15067:
-----------------------------------------------
Thanks [~ayushtkn] for review.
{quote}I guess the standby/observer namenode will not be sending any response
to the datanode, so the heartbeat interval for the standby shall always be the
max configured,
Just a opinion, the standby and observer, will in anyway, reach to max skip
interval, may be we can shoot them directly to the max value post first heart
beat rather than going exponentially.
{quote}
Do you think it will give some benefits ?. Standby/Observer anyway not doing
anything, sending extra heartbeat by independent thread will not cost anything .
{quote} I think in case of failover, we should reset the counter to start,
{quote}
handled.
{quote}In case of Connection Exception, or any connection issues
{quote}
handled
{quote}For the default value the number has 3 in the defaults, in case of
invalid that shoots to {{StaleInterval - 1 HeartBeat}} both seems at quite
extremes, the first being at the lower and the later being at the higher, I
think we can keep something is percent to stale interval, may be 40% or 50% to
stale interval.
{quote}
Admin should handle this configuration only if he know the NN and DN
communication pattern. Configuring wrong thing in big cluster is not accepted
and if he configured also he should correct it when he think system is behaving
abnormally.
I don't think configuring in percentage is good idea. heartbeats are major
thing and it should be counted in numbers only. For example if doctor gives you
some pills and if he asked you to take 10% of pills daily, You need to
calculate and find out how many pills you need to take, but doctor don't know
what result you got after your calculation and you are taking correct number of
pills are not.
Based on configured heartbeat interval he can easily find out how many max
heartbeat we should skip even in worst case to run system normally. Admin
should try to skip minimum heartbeat to delay some other operation. I feel 3
heartbeats are ideal based on 3sec heartbeat interval.
{quote}nit : in case of change in value specified, there should be a warn log,
stating specified value is more then stale interval, using default of..
{quote}
handled.
> Optimize heartbeat for large cluster
> ------------------------------------
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode
> Affects Versions: 3.1.1
> Reporter: Surendra Singh Lilhore
> Assignee: Surendra Singh Lilhore
> Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch,
> image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each
> 3sec. This will impact the client response time. This heart beat can be
> optimized. DN can start skipping one heart beat if no
> work(Write/replication/Delete) is allocated from long time. DN can start
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it
> can start sending heart beat normally.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]