[ 
https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588819#comment-14588819
 ] 

Colin Patrick McCabe commented on HDFS-7923:
--------------------------------------------

This change is important for avoiding cascading failures (aka congestion 
collapse.)  Currently when the NN gets too many full block reports at once, the 
extra block reports slow down the processing of the existing ones (because 
storing the large RPCs generates GC activity up to and including full GCs).  So 
you get into a negative spiral-- can't process FBRs fast enough?  Then have 
some more FBRs which will slow you down even more.  And so on.  Keep in mind 
with the previous code, the DN would send its full block report all over again 
if the NN didn't respond within some timeout, which could lead to the NN having 
multiple (large) copies of the same full block report queued up.  It's true 
that you could usually avoid these scenarios by careful configuration and 
tuning, but this kind of fragile congestion collapse behavior should not be in 
the system.  This change is also important for maintaining any sort of 
reasonable quality of service on the NN, since otherwise we can get completely 
flooded with FBRs and can't do any other work.

> The DataNodes should rate-limit their full block reports by asking the NN on 
> heartbeat messages
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7923
>                 URL: https://issues.apache.org/jira/browse/HDFS-7923
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: 2.8.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: 2.8.0
>
>         Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch, 
> HDFS-7923.002.patch, HDFS-7923.003.patch, HDFS-7923.004.patch, 
> HDFS-7923.006.patch, HDFS-7923.007.patch
>
>
> The DataNodes should rate-limit their full block reports.  They can do this 
> by first sending a heartbeat message to the NN with an optional boolean set 
> which requests permission to send a full block report.  If the NN responds 
> with another optional boolean set, the DN will send an FBR... if not, it will 
> wait until later.  This can be done compatibly with optional fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to