[ https://issues.apache.org/jira/browse/HDFS-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13548872#comment-13548872 ]
Daryn Sharp commented on HDFS-4288: ----------------------------------- bq. This will also solve issues related to DN restart and NN may not process the block report. True, but the boolean patch (simple incremental improvement on the existing trunk behavior) fixes both DN restart and reregistration after a broken connection. The NN cannot distinguish the two. So with a boolean, the NN (naively) processes the BR associated with every (re)registration. A sequence number, that relies on a sentinel value, allows the DN to dictate the NN's behavior. This works well for restart since we know we are starting from 0. For a rereg, block updates may have been lost, so the sequence number must be guaranteed to always be reset to 0. That's naive like the boolean, and might be hard or fragile to ensure it's always reset - in which case we might as well go with the boolean. Better yet, the logic would be to {{(seqNum == 0 || seqNum != lastSeqNum)}}. However this requires writable/RPC changes on 23, and protobuf changes on trunk/2 and trying to ensure backwards compatibility with an optional protobuf field, etc. Would you be ok if I filed another jira? > NN accepts incremental BR as IBR in safemode > -------------------------------------------- > > Key: HDFS-4288 > URL: https://issues.apache.org/jira/browse/HDFS-4288 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0 > Reporter: Daryn Sharp > Assignee: Daryn Sharp > Priority: Critical > Attachments: HDFS-4288.branch-23.patch > > > If a DN is ready to send an incremental BR and the NN goes down, the DN will > repeatedly try to reconnect. The NN will then process the DN's incremental > BR as an initial BR. The NN now thinks the DN has only a few blocks, and > will ignore all subsequent BRs from that DN until out of safemode -- which it > may never do because of all the "missing" blocks on the affected DNs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira