[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16843936#comment-16843936
 ] 

He Xiaoqiao commented on HDFS-12914:
------------------------------------

[~smarella] Thanks for your report, I think you offer complete information and 
the reason is clear. 
As we all know, block report processing is very heavy request, and process time 
may be longer than other RPCs, especially blocks number is very large located 
one DataNode and not the first block report just after NameNode startup.
To your report issue,
a. t1 request full block report lease through heart beat from NameNode.
b. t2 lease return to DataNode.
c. t3 send FBR from DataNode.
d. t4 FBR enter NameNode call queue.
e. t5 NameNode begin to process FBR one by one #StorageBlockReport, and finish 
to process first 3 #StorageBlockReport successfully.
f. t6 NameNode process the fourth #StorageBlockReport and find lease has 
expired and log `the lease has expired` then remove this lease;
g. t7 finish to process the remain 8 #StorageBlockReport and lease also has 
expired and log `the DN is not in the pending set`;
which t5 - t1 < 5min and t6 - t1 > 5min. 

I think during that times, load of NameNode is very high, and CallQueue of 
service rpc port (if not config, it is rpc port) is continued full for long 
times (maybe it is long than 5min)
As mentioned above, the root cause is that we check lease for every 
#StorageBlockReport of one DataNode. So I think the solution is also clear, 
just check lease once for each DataNode rather than every  #StorageBlockReport 
of DataNode.
I would like to follow this issue and submit patch later.

> Block report leases cause missing blocks until next report
> ----------------------------------------------------------
>
>                 Key: HDFS-12914
>                 URL: https://issues.apache.org/jira/browse/HDFS-12914
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to