[ https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588819#comment-14588819 ]
Colin Patrick McCabe commented on HDFS-7923: -------------------------------------------- This change is important for avoiding cascading failures (aka congestion collapse.) Currently when the NN gets too many full block reports at once, the extra block reports slow down the processing of the existing ones (because storing the large RPCs generates GC activity up to and including full GCs). So you get into a negative spiral-- can't process FBRs fast enough? Then have some more FBRs which will slow you down even more. And so on. Keep in mind with the previous code, the DN would send its full block report all over again if the NN didn't respond within some timeout, which could lead to the NN having multiple (large) copies of the same full block report queued up. It's true that you could usually avoid these scenarios by careful configuration and tuning, but this kind of fragile congestion collapse behavior should not be in the system. This change is also important for maintaining any sort of reasonable quality of service on the NN, since otherwise we can get completely flooded with FBRs and can't do any other work. > The DataNodes should rate-limit their full block reports by asking the NN on > heartbeat messages > ----------------------------------------------------------------------------------------------- > > Key: HDFS-7923 > URL: https://issues.apache.org/jira/browse/HDFS-7923 > Project: Hadoop HDFS > Issue Type: Sub-task > Affects Versions: 2.8.0 > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Fix For: 2.8.0 > > Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch, > HDFS-7923.002.patch, HDFS-7923.003.patch, HDFS-7923.004.patch, > HDFS-7923.006.patch, HDFS-7923.007.patch > > > The DataNodes should rate-limit their full block reports. They can do this > by first sending a heartbeat message to the NN with an optional boolean set > which requests permission to send a full block report. If the NN responds > with another optional boolean set, the DN will send an FBR... if not, it will > wait until later. This can be done compatibly with optional fields. -- This message was sent by Atlassian JIRA (v6.3.4#6332)