Uma Maheswara Rao G created HDFS-17255:
------------------------------------------
Summary: There should be mechanism between client and NN to
eliminate stale nodes from current pipeline sooner.
Key: HDFS-17255
URL: https://issues.apache.org/jira/browse/HDFS-17255
Project: Hadoop HDFS
Issue Type: Bug
Reporter: Uma Maheswara Rao G
In one of users cluster, they hit an issue similar to HDFS-2891. Client is
always seeing first node as failed even though 2nd node is the problematic one(
timeouts due to pulling out for NW). When pipeline failure happens, client will
ask for another new node and replace it in pipeline. But actual bad mode still
be in pipeline as client detected wrong node ( actually a good node) as bad.
So, pipeline failure continues until it detects the real wrong node in random
shuffling. NN actully detected wrong node as stale. But pipeline reconstruction
will only bother about client detected failed node and it will be replaced with
new node.
I don't have best solution in hand, but we can discuss. I think it may be a
good idea if client pass all current pipeline node to recheck in first pipeline
failure. So, NN can give some hints back to client which other nodes are not
good and provide additional backup replacement nodes in a single call. It looks
over designing to me, but I don't really have any other best ideas in my mind.
Changing protocol API is painful due to compatibility problems and testing
needed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]