[jira] [Commented] (HDFS-12914) Block report leases cause missing blocks until next report

He Xiaoqiao (JIRA) Sat, 08 Jun 2019 09:53:24 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859264#comment-16859264
 ]


He Xiaoqiao commented on HDFS-12914:
------------------------------------

Thanks [~elgoiri],[~starphin] for your reviews.
To [~elgoiri],
{quote}checkBlockReportLease() could check for context == null at the beginning 
and return true there right away; then the final return would be just the 
checkLease().{quote}
IIUC, the following code logic is same as you said. Actually, I change it based 
on HadoopQA checkstyle suggestions.:)
{code:java}
+    return context == null || blockReportLeaseManager.checkLease(
+        node, startTime, context.getLeaseId());
{code}
{quote}When NameNodeRpcServer catches the UnregisteredNodeException we probably 
want to log that.{quote}
We could meet UnregisteredNodeException when invoke DatanodeManager#getDatanode 
which has logged before throw exception. Maybe we do not need duplicated logs. 
Please help to double check.
{quote}We could use a lambda for runBlockOp().{quote}
+1, update in [^HDFS-12914.007.patch].
{quote}User assertNotNull() instead of assertTrue(datanodeCommand != null); 
actually can we check for the actual command?{quote}
+1, update in [^HDFS-12914.007.patch].

To [~starphin],
Thanks for your unit test attach, I am sorry that I do not understand that 
clearly. 
{quote}Following codes bypass lease expiration checking logic by removing valid 
lease id. Better to keep it as it is in running time.{quote}
In my unit test, I just want to reprod the scenario [~smarella] mentioned. 
remove valid lease id analog to lease expire between process storages of one 
datanode. I think It's OK in unit test when remove it manually. If I am wrong 
please correct me.
{quote}Do we really need to response with a RegisterCommand.REGISTER command to 
client? It's a somewhat heavy command.{quote}
I think `client` you mentioned is Datanode, right? If one datanode not register 
and send some other RPC request to NameNode, it should get 
RegisterCommand.REGISTER and try to re-register. It seems a normal flow.
Thanks [~elgoiri] and [~starphin] again. [^HDFS-12914.007.patch] please take 
another kindly review.

> Block report leases cause missing blocks until next report
> ----------------------------------------------------------
>
>                 Key: HDFS-12914
>                 URL: https://issues.apache.org/jira/browse/HDFS-12914
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.8.0, 2.9.2
>            Reporter: Daryn Sharp
>            Assignee: Santosh Marella
>            Priority: Critical
>         Attachments: HDFS-12914-branch-2.001.patch, 
> HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch, HDFS-12914.005.patch, 
> HDFS-12914.006.patch, HDFS-12914.007.patch
>
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-12914) Block report leases cause missing blocks until next report

Reply via email to