[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286285#comment-16286285
 ] 

Daryn Sharp commented on HDFS-12914:
------------------------------------

Had a cluster with a job causing unusually heavy IO.  DNs became moderately 
congested with commands.  Eventually 1 was declared dead.  Upon rejoining a few 
mins later, the FBR was rejected with "because the DN is not in the pending 
set".  The replication storm in conjunction with the bad job caused nodes to go 
dead like dominos.  Some that rejoined had their FBR rejected with "is not 
valid for unknown datanode" in addition to "because the DN is not in the 
pending set".

On a 2400 node cluster, ~400 nodes were temporarily dead.  304 had their FBRs 
rejected when rejoining.  80k blocks were missing.  Had to force FBRs to bring 
the blocks back.

I have no clue why a rejected report clears the storage/node's stale state!

> Block report leases cause missing blocks until next report
> ----------------------------------------------------------
>
>                 Key: HDFS-12914
>                 URL: https://issues.apache.org/jira/browse/HDFS-12914
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.8.0
>            Reporter: Daryn Sharp
>            Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to