[
https://issues.apache.org/jira/browse/HDFS-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17655752#comment-17655752
]
Xing Lin commented on HDFS-15901:
---------------------------------
Do we have any followup on this issue?
We are seeing a similar issue happening at Linkedin as well. The standby NN can
be stuck in safe mode when restarted for some of the large clusters. When NN
stuck in safe mode, the number of missing blocks each time are different. We
are not sure what is causing the issue but could the following hypothesis be
the case?
In safe mode, the standby NN receives the first FBR from DN1/DN2/DN3. At a
later time, blockA is deleted and it is removed from DN1/DN2/DN3 and they send
in a new incremental Block report (IBR). However, NN does not process these
IBRs (for example, it is paused due to GC). NN will not process any non-initial
FBR from DN1/DN2/DN3 and it will never know that blockA is already removed from
the cluster and blockA becomes the missing block it will wait forever.
> Solve the problem of DN repeated block reports occupying too many RPCs during
> Safemode
> --------------------------------------------------------------------------------------
>
> Key: HDFS-15901
> URL: https://issues.apache.org/jira/browse/HDFS-15901
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: JiangHua Zhu
> Assignee: JiangHua Zhu
> Priority: Major
> Labels: pull-request-available
> Time Spent: 1h
> Remaining Estimate: 0h
>
> When the cluster exceeds thousands of nodes, we want to restart the NameNode
> service, and all DataNodes send a full Block action to the NameNode. During
> SafeMode, some DataNodes may send blocks to NameNode multiple times, which
> will take up too much RPC. In fact, this is unnecessary.
> In this case, some block report leases will fail or time out, and in extreme
> cases, the NameNode will always stay in Safe Mode.
> 2021-03-14 08:16:25,873 [78438700] - INFO [Block report
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded
> non-initial block report from DatanodeRegistration(xxxxxxxx:port,
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx,
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode
> still in startup phase
> 2021-03-14 08:16:31,521 [78444348] - INFO [Block report
> processor:BlockManager@2158] - BLOCK* processReport 0xexxxxxxxx: discarded
> non-initial block report from DatanodeRegistration(xxxxxxxx,
> datanodeUuid=xxxxxxxx, infoPort=xxxxxxxx, infoSecurePort=xxxxxxxx,
> ipcPort=xxxxxxxx, storageInfo=lv=xxxxxxxx;nsid=xxxxxxxx;c=0) because namenode
> still in startup phase
> 2021-03-13 18:35:38,200 [29191027] - WARN [Block report
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN [Block report
> processor:BlockReportLeaseManager@311] - BR lease 0xxxxxxxxx is not valid for
> DN xxxxxxxx, because the DN is not in the pending set.
> 2021-03-13 18:36:08,143 [29220970] - WARN [Block report
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for
> DN xxxxxxxx, because the lease has expired.
> 2021-03-13 18:36:08,145 [29220972] - WARN [Block report
> processor:BlockReportLeaseManager@317] - BR lease 0xxxxxxxxx is not valid for
> DN xxxxxxxx, because the lease has expired.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]