[
https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588819#comment-14588819
]
Colin Patrick McCabe commented on HDFS-7923:
--------------------------------------------
This change is important for avoiding cascading failures (aka congestion
collapse.) Currently when the NN gets too many full block reports at once, the
extra block reports slow down the processing of the existing ones (because
storing the large RPCs generates GC activity up to and including full GCs). So
you get into a negative spiral-- can't process FBRs fast enough? Then have
some more FBRs which will slow you down even more. And so on. Keep in mind
with the previous code, the DN would send its full block report all over again
if the NN didn't respond within some timeout, which could lead to the NN having
multiple (large) copies of the same full block report queued up. It's true
that you could usually avoid these scenarios by careful configuration and
tuning, but this kind of fragile congestion collapse behavior should not be in
the system. This change is also important for maintaining any sort of
reasonable quality of service on the NN, since otherwise we can get completely
flooded with FBRs and can't do any other work.
> The DataNodes should rate-limit their full block reports by asking the NN on
> heartbeat messages
> -----------------------------------------------------------------------------------------------
>
> Key: HDFS-7923
> URL: https://issues.apache.org/jira/browse/HDFS-7923
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Affects Versions: 2.8.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Fix For: 2.8.0
>
> Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch,
> HDFS-7923.002.patch, HDFS-7923.003.patch, HDFS-7923.004.patch,
> HDFS-7923.006.patch, HDFS-7923.007.patch
>
>
> The DataNodes should rate-limit their full block reports. They can do this
> by first sending a heartbeat message to the NN with an optional boolean set
> which requests permission to send a full block report. If the NN responds
> with another optional boolean set, the DN will send an FBR... if not, it will
> wait until later. This can be done compatibly with optional fields.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)