[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

Xiaoqiao He (Jira) Fri, 19 Mar 2021 01:32:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304722#comment-17304722
 ]


Xiaoqiao He commented on HDFS-15901:
------------------------------------

Thanks for involving me here. IIRC, we have discussed many times for improving 
FBR and I believe some guys has done some good works internal. For this case, 
sorry I do not get which version do you deploy. but for trunk branch, this 
repeat FBR has been discarded per DataNodeStorage during Safemode phase when 
restart NameNode service as the following code segment shows 
(org.apache.hadoop.hdfs.server.blockmanagement.BlockManager#processReport).
{code:java}
      if (namesystem.isInStartupSafeMode()
          && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
          && storageInfo.getBlockReportCount() > 0) {
        blockLog.info("BLOCK* processReport 0x{}: "
            + "discarded non-initial block report from {}"
            + " because namenode still in startup phase",
            strBlockReportId, nodeID);
        blockReportLeaseManager.removeLease(node);
        return !node.hasStaleStorages();
      }
{code}
It matches the following log mentioned above.
{code:java}
2021-03-14 08:16:25,873 [78438700] - INFO [Block report 
processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
non-initial block report from DatanodeRegistration(xxxxxxxx:port, 
datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
still in startup phase
2021-03-14 08:16:31,521 [78444348] - INFO [Block report 
processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
non-initial block report from DatanodeRegistration(xxxxxxxx, 
datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
still in startup phase
{code}
In my practice, for the largest cluster (more than over 10K nodes), it costs 
near 2 hour at worst case. AS discussed offline with [~jianghuazhu], I think it 
is necessary to dig what happens and where it costs much. It will be useful if 
would like to offer some more log or information. Thanks.

{code:java}
We have recently gut out the FBR lease feature internally and implemented a new 
block report flow control system. It was designed by Daryn Sharp. It hasn't 
been tested fully yet, so we haven't shared it with the community.
{code}
[~kihwal] look forward to your new block report flow control system, Do you 
have plan to submit to community? Thanks.

> Solve the problem of DN repeated block reports occupying too many RPCs during 
> Safemode
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-15901
>                 URL: https://issues.apache.org/jira/browse/HDFS-15901
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: JiangHua Zhu
>            Assignee: JiangHua Zhu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode 
> service, and all DataNodes send a full Block action to the NameNode. During 
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which 
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme 
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
> non-initial block report from DatanodeRegistration(xxxxxxxx:port, 
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO  [Block report 
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded 
> non-initial block report from DatanodeRegistration(xxxxxxxx, 
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx, 
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode 
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN  [Block report 
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for 
> DN xxxxxxxx, because the lease has expired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15901) Solve the problem of DN repeated block reports occupying too many RPCs during Safemode

Reply via email to