[ 
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009762#comment-15009762
 ] 

Chris Nauroth commented on HDFS-9239:
-------------------------------------

Thanks for the great reviews, everyone.  I'll respond to some of the feedback 
now and update the patch later.

bq. What you could do after the stat update is use tryLock for a short time. If 
you can't get the lock, oh well, this heartbeat response doesn't get any 
commands.

That's an interesting idea.  I'll explore this.

bq. I'm not sure we need yet another RPC server for this purpose.

Just to make sure everyone is aware, the additional optional RPC server is 
effectively already there due to committing HDFS-9311, which specifically 
targeted the related problem of ZKFC health check messages getting blocked.  I 
agree that yet another RPC server is not ideal in terms of operational 
complexity, but I also saw it as the only viable option short-term achievable 
in the 2.x line.

bq. In {{BPOfferServiceActor#run}} we retry await operation on being 
interrupted. My question is when would it be safe to retry ?

In practice, I expect it never will retry.  Suppose the thread enters {{await}} 
on the latch, but gets interrupted before the initial registration completes.  
The only thing that triggers thread interruption is shutdown of the whole 
DataNode (either the whole JVM process or a single DataNode inside a 
{{MiniDFSCluster}}).  That means that by the time this thread gets interrupted, 
internal flags have already been updated such that the {{shouldRun()}} call on 
the next iteration will return {{false}}.  {{run()}} would then return with no 
further action taken, and the thread would stop.

However, there is also no harm done if the {{await}} gets retried.  This is a 
daemon thread, so even if it keeps retrying, it won't block a normal JVM exit.

bq. Just wondering if it makes sense to move synchronized(datanodeMap) into 
getDataNode.

This might be a good idea, but I'd prefer to treat it as a separate code 
improvement decoupled from the work here.  Right now, there are multiple points 
in the code that depend on specific lock ordering of {{heartbeatManager}} first 
followed by {{datanodeMap}} second to prevent deadlock.  The current code makes 
this explicit with nested {{synchronized (X)}} blocks instead of implicit by 
declaring particular methods {{synchronized}}.  Also, HDFS-8966 is making a lot 
of the changes in the locking here, so I expect making changes now would just 
create merge conflicts for that effort later.

bq. Did you intend to call {{heartbeatManager.updateLifeline}} inside the 
{{synchronized(datanodeMap)}} or just inside {{synchronized 
(heartbeatManager)}}. Do we need to keep the lock on datanodeMap while updating 
stats ?

This locking was intentional.  If we do not hold the lock on {{datanodeMap}} 
during the get+update, then there is a risk that multiple heartbeats in flight 
for the same DataNode could cause a lost update, or even an inconsistency where 
the final state of the {{DatanodeDescriptor}} really contains some stats from 
one heartbeat and other stats from another heartbeat.  I have not observed 
excessive lock contention here during the issues that prompted me to file this 
JIRA, so I didn't try hard to optimize this locking away.  Some of the work in 
HDFS-8966 is likely to reduce the locking here anyway.

bq. NN should still enforce a max number of skips and guarantee commands are 
sent in bounded time. Replication or block recovery is done through an 
asynchronous protocol, but oftentimes clients expect them to be done "soon".

Are you saying that beyond some skipping threshold, the heartbeat should still 
be considered a failure, eventually causing the DataNode to be marked stale and 
then dead?  I'm not sure how we'd set such a threshold, given that there is no 
SLA defined on these operations AFAIK.  I'd say there is still value in keeping 
a DataNode alive in these cases, such as for serving reader activity.

bq. It seems the introduction of a new RPC server is to work around the 
existing functionality of RPC which only support QoS based on user names.

Yes, that's partially correct.

I agree that introduction of another RPC server is in some sense a workaround.  
In fact, I would make the same argument for {{dfs.namenode.servicerpc-address}} 
in the first place too.  Lack of sophisticated QoS drove us to isolate 
operations to a separate RPC server.  ({{dfs.namenode.servicerpc-address}} also 
can have some side benefits in multi-homed environments that want to dedicate a 
whole separate network interface to certain operations.)

I think a separate RPC server is a viable short-to-mid-term solution in the 2.x 
line.  Longer term, I'd prefer that we evolve towards more sophisticated QoS 
that prioritizes critical "control plane" activity like heartbeats.  However, 
this goes deeper than just QoS at the RPC layer, because we have observed 
contention on the namesystem lock as a contributing factor.  This implies that 
a full solution, running on just one RPC server, requires some kind of call 
pre-emption capability, or maybe a cooperative multi-tasking approach with 
operations having the ability to yield the lock.  This of course would violate 
a bunch of assumptions in the NameNode code, which is why I think it's 
infeasible in 2.x.


> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode 
> liveness
> -----------------------------------------------------------------------------------
>
>                 Key: HDFS-9239
>                 URL: https://issues.apache.org/jira/browse/HDFS-9239
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Chris Nauroth
>            Assignee: Chris Nauroth
>         Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline 
> Protocol.  This is an RPC protocol that is responsible for reporting liveness 
> and basic health information about a DataNode to a NameNode.  Compared to the 
> existing heartbeat messages, it is lightweight and not prone to resource 
> contention problems that can harm accurate tracking of DataNode liveness 
> currently.  The attached design document contains more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to