[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster

Surendra Singh Lilhore (Jira) Fri, 10 Jan 2020 13:46:47 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013221#comment-17013221
 ]


Surendra Singh Lilhore commented on HDFS-15067:
-----------------------------------------------

Thanks [~ayushtkn]  for review.
{quote}I guess the standby/observer namenode will not be sending any response 
to the datanode, so the heartbeat interval for the standby shall always be the 
max configured,

Just a opinion, the standby and observer, will in anyway, reach to max skip 
interval, may be we can shoot them directly to the max value post first heart 
beat rather than going exponentially.
{quote}
Do you think it will give some benefits ?. Standby/Observer anyway not doing 
anything, sending extra heartbeat by independent thread will not cost anything .
{quote} I think in case of failover, we should reset the counter to start,
{quote}
handled.
{quote}In case of Connection Exception, or any connection issues
{quote}
handled
{quote}For the default value the number has 3 in the defaults, in case of 
invalid that shoots to {{StaleInterval - 1 HeartBeat}} both seems at quite 
extremes, the first being at the lower and the later being at the higher, I 
think we can keep something is percent to stale interval, may be 40% or 50% to 
stale interval.
{quote}
Admin should handle this configuration only if he know the NN and DN 
communication pattern. Configuring wrong thing in big cluster is not accepted 
and if he configured also he should correct it when he think system is behaving 
abnormally.

I don't think configuring in percentage is good idea. heartbeats are major 
thing and it should be counted in numbers only. For example if doctor gives you 
some pills and if he asked you to take 10% of pills daily, You need to 
calculate and find out how many pills you need to take, but doctor don't know 
what result you got after your calculation and you are taking correct number of 
pills are not.

Based on configured heartbeat interval he can easily find out how  many max 
heartbeat we should skip even in worst case to run system normally. Admin 
should try to skip minimum heartbeat to delay some other operation. I feel 3 
heartbeats are ideal based on 3sec heartbeat interval.
{quote}nit : in case of change in value specified, there should be a warn log, 
stating specified value is more then stale interval, using default of..
{quote}
handled.

> Optimize heartbeat for large cluster
> ------------------------------------
>
>                 Key: HDFS-15067
>                 URL: https://issues.apache.org/jira/browse/HDFS-15067
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>    Affects Versions: 3.1.1
>            Reporter: Surendra Singh Lilhore
>            Assignee: Surendra Singh Lilhore
>            Priority: Major
>         Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster

Reply via email to