[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001367#comment-15001367
 ] 

Anu Engineer commented on HDFS-9239:
------------------------------------

[~cnauroth] Without alluding to anything that [~daryn] brought up, Thanks for 
the patch. It looks very good, some minor comments / questions below. 

1.  nit : {{BPofferService#Constructor}} 
Perhaps add a Precondition to make this relationship explicit.  something like 
Preconditions.checkState(lifelineNnAddrs.size() == nnAddrs.size()), since we 
access both lists with the same index ? 

2. More of a question for my own understanding :
In  {{BPOfferServiceActor#run}} we retry await operation on being interrupted. 
My question is when would it be safe to retry ? Wanted to understand if there 
were any scenarios where this can happen. Btw, I do see this as a good coding 
pattern.

{code}
while (shouldRun()) {
        try {
          initialRegistrationComplete.await();
          break;
        } catch (InterruptedException e) {
          Thread.currentThread().interrupt();
        }
      }
{code}


3. nit: DataNodeManager.java - Javadoc - xmitsInProgress replaced by 
maxTransfers

4. In DataNodeManager.java : 
{code}
  synchronized (datanodeMap) {
        DatanodeDescriptor nodeinfo = getDatanode(nodeReg);
{code}
Two comments here :
  a. Just wondering if it makes sense to move synchronized(datanodeMap) into 
getDataNode.
 b. Did you intend to call {{heartbeatManager.updateLifeline}} inside the 
{{synchronized(datanodeMap)}} or just inside {{synchronized 
(heartbeatManager)}}. Do we need to keep the lock on datanodeMap while updating 
stats ? 

5. hdfs-default.xml : nit  : Comment :  since we rely on ratio as the default 
you might want to fix the comment which says default is 1.
  <property>
  <name>dfs.namenode.lifeline.handler.count</name>
  <value></value>
  <description>
    Sets number of RPC server threads the NameNode runs for handling the
    lifeline RPC server.  The default value is 1, because this RPC server
    handles only HA health check requests from ZKFC.  These are lightweight


> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to