[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011764#comment-17011764
 ] 

Surendra Singh Lilhore commented on HDFS-15067:
-----------------------------------------------

Design Idea :

=========

Added two new property 
|*Property*|*Description*|*Default value*|
|dfs.datanode.heartbeat.optimizer.skip.max.heartbeat|Max number of heartbeat 
can be skipped in one time.|3|
|dfs.datanode.heartbeat.optimizer.max.idle.time|Datanode idle time after which 
it start skipping heartbeat. Default value is 0, means this feature is 
disabled.|0|

User need to configure max heartbeat to skip and datanode max idle time, after 
this time datanode start skipping heartbeat incrementlly. After elapsing first 
idle window it will skip one heartbeat, after elapsing 2 idle window it will 
skip two heartbeat and so on but it will give guaranty to send at least one 
heartbeat before stale interval.

How many heartbeat can be skipped in one time is main logic and this is depend 
on stale interval of namenode. This property is not available in datanode this 
value we need get from Namenode. This we can receive in DatanodeRegistration 
from namenode at the time of registration. Skipping max heartbeat is depend one 
stale interval. And it will be calculated based on this formula.

*Max heartbeat to skip = min((staleInterval – heartbeatInterval)/ 
heartbeatInterval, configuredMaxHeartbeatSkip);*

!image-2020-01-09-18-00-49-556.png!

> Optimize heartbeat for large cluster
> ------------------------------------
>
>                 Key: HDFS-15067
>                 URL: https://issues.apache.org/jira/browse/HDFS-15067
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 3.1.1
>            Reporter: Surendra Singh Lilhore
>            Assignee: Surendra Singh Lilhore
>            Priority: Major
>         Attachments: image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to