[ https://issues.apache.org/jira/browse/HDFS-17780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950796#comment-17950796 ]
ASF GitHub Bot commented on HDFS-17780: --------------------------------------- shangshu-qian opened a new pull request, #7681: URL: https://github.com/apache/hadoop/pull/7681 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR As described in [HDFS-17780](https://issues.apache.org/jira/browse/HDFS-17780), the retry logic in sendIBR() can bypass the configuration of `dfs.blockreport.incremental.intervalMsec` and cause the IBR to be sent with every heartbeat. The fix updates the IBR timestamp every time the RPC is called. ### How was this patch tested? No test needed. ### For code changes: - [X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > The retry logic in IncrementalBlockReport may bypass the configured IBR > interval, causing contention on NameNode > ---------------------------------------------------------------------------------------------------------------- > > Key: HDFS-17780 > URL: https://issues.apache.org/jira/browse/HDFS-17780 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode > Affects Versions: 2.10.2, 3.4.1 > Reporter: Shangshu Qian > Priority: Major > > In the current IncrementalBlockReportManager.sendIBR(), the IBR is retried if > the RPC (blockReceivedAndDeleted) to NN fails. > > {code:java} > void sendIBRs(DatanodeProtocol namenode, DatanodeRegistration registration, > String bpid) throws IOException { > // Generate a list of the pending reports for each storage under the lock > final StorageReceivedDeletedBlocks[] reports = generateIBRs(); > if (reports.length == 0) { > // Nothing new to report. > return; > } // Send incremental block reports to the Namenode outside the lock > if (LOG.isDebugEnabled()) { > LOG.debug("call blockReceivedAndDeleted: " + Arrays.toString(reports)); > } > boolean success = false; > final long startTime = monotonicNow(); > try { > namenode.blockReceivedAndDeleted(registration, bpid, reports); > success = true; > } finally { if (success) { > dnMetrics.addIncrementalBlockReport(monotonicNow() - startTime); > lastIBR = startTime; > } else { > // If we didn't succeed in sending the report, put all of the > // blocks back onto our queue, but only in the case where we > // didn't put something newer in the meantime. > putMissing(reports); > } > } > } {code} > The retry does not update the `lastIBR` variable, so the failed IBRs will be > retried. However, this retry bypasses the configured > `dfs.blockreport.incremental.intervalMsec` and will be retied on the next > heartbeat because `lastIBR` is not updated. > > If the `blockReceivedAndDeleted` fails due to the high load on the NameNode, > such retry will only make the contention worse, resulting in a feedback loop. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org