[
https://issues.apache.org/jira/browse/HDFS-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009762#comment-15009762
]
Chris Nauroth commented on HDFS-9239:
-------------------------------------
Thanks for the great reviews, everyone. I'll respond to some of the feedback
now and update the patch later.
bq. What you could do after the stat update is use tryLock for a short time. If
you can't get the lock, oh well, this heartbeat response doesn't get any
commands.
That's an interesting idea. I'll explore this.
bq. I'm not sure we need yet another RPC server for this purpose.
Just to make sure everyone is aware, the additional optional RPC server is
effectively already there due to committing HDFS-9311, which specifically
targeted the related problem of ZKFC health check messages getting blocked. I
agree that yet another RPC server is not ideal in terms of operational
complexity, but I also saw it as the only viable option short-term achievable
in the 2.x line.
bq. In {{BPOfferServiceActor#run}} we retry await operation on being
interrupted. My question is when would it be safe to retry ?
In practice, I expect it never will retry. Suppose the thread enters {{await}}
on the latch, but gets interrupted before the initial registration completes.
The only thing that triggers thread interruption is shutdown of the whole
DataNode (either the whole JVM process or a single DataNode inside a
{{MiniDFSCluster}}). That means that by the time this thread gets interrupted,
internal flags have already been updated such that the {{shouldRun()}} call on
the next iteration will return {{false}}. {{run()}} would then return with no
further action taken, and the thread would stop.
However, there is also no harm done if the {{await}} gets retried. This is a
daemon thread, so even if it keeps retrying, it won't block a normal JVM exit.
bq. Just wondering if it makes sense to move synchronized(datanodeMap) into
getDataNode.
This might be a good idea, but I'd prefer to treat it as a separate code
improvement decoupled from the work here. Right now, there are multiple points
in the code that depend on specific lock ordering of {{heartbeatManager}} first
followed by {{datanodeMap}} second to prevent deadlock. The current code makes
this explicit with nested {{synchronized (X)}} blocks instead of implicit by
declaring particular methods {{synchronized}}. Also, HDFS-8966 is making a lot
of the changes in the locking here, so I expect making changes now would just
create merge conflicts for that effort later.
bq. Did you intend to call {{heartbeatManager.updateLifeline}} inside the
{{synchronized(datanodeMap)}} or just inside {{synchronized
(heartbeatManager)}}. Do we need to keep the lock on datanodeMap while updating
stats ?
This locking was intentional. If we do not hold the lock on {{datanodeMap}}
during the get+update, then there is a risk that multiple heartbeats in flight
for the same DataNode could cause a lost update, or even an inconsistency where
the final state of the {{DatanodeDescriptor}} really contains some stats from
one heartbeat and other stats from another heartbeat. I have not observed
excessive lock contention here during the issues that prompted me to file this
JIRA, so I didn't try hard to optimize this locking away. Some of the work in
HDFS-8966 is likely to reduce the locking here anyway.
bq. NN should still enforce a max number of skips and guarantee commands are
sent in bounded time. Replication or block recovery is done through an
asynchronous protocol, but oftentimes clients expect them to be done "soon".
Are you saying that beyond some skipping threshold, the heartbeat should still
be considered a failure, eventually causing the DataNode to be marked stale and
then dead? I'm not sure how we'd set such a threshold, given that there is no
SLA defined on these operations AFAIK. I'd say there is still value in keeping
a DataNode alive in these cases, such as for serving reader activity.
bq. It seems the introduction of a new RPC server is to work around the
existing functionality of RPC which only support QoS based on user names.
Yes, that's partially correct.
I agree that introduction of another RPC server is in some sense a workaround.
In fact, I would make the same argument for {{dfs.namenode.servicerpc-address}}
in the first place too. Lack of sophisticated QoS drove us to isolate
operations to a separate RPC server. ({{dfs.namenode.servicerpc-address}} also
can have some side benefits in multi-homed environments that want to dedicate a
whole separate network interface to certain operations.)
I think a separate RPC server is a viable short-to-mid-term solution in the 2.x
line. Longer term, I'd prefer that we evolve towards more sophisticated QoS
that prioritizes critical "control plane" activity like heartbeats. However,
this goes deeper than just QoS at the RPC layer, because we have observed
contention on the namesystem lock as a contributing factor. This implies that
a full solution, running on just one RPC server, requires some kind of call
pre-emption capability, or maybe a cooperative multi-tasking approach with
operations having the ability to yield the lock. This of course would violate
a bunch of assumptions in the NameNode code, which is why I think it's
infeasible in 2.x.
> DataNode Lifeline Protocol: an alternative protocol for reporting DataNode
> liveness
> -----------------------------------------------------------------------------------
>
> Key: HDFS-9239
> URL: https://issues.apache.org/jira/browse/HDFS-9239
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Reporter: Chris Nauroth
> Assignee: Chris Nauroth
> Attachments: DataNode-Lifeline-Protocol.pdf, HDFS-9239.001.patch
>
>
> This issue proposes introduction of a new feature: the DataNode Lifeline
> Protocol. This is an RPC protocol that is responsible for reporting liveness
> and basic health information about a DataNode to a NameNode. Compared to the
> existing heartbeat messages, it is lightweight and not prone to resource
> contention problems that can harm accurate tracking of DataNode liveness
> currently. The attached design document contains more details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)