Marton Elek created HDDS-2679:
---------------------------------
Summary: Ratis ring creation might be failed with async pipeline
creation
Key: HDDS-2679
URL: https://issues.apache.org/jira/browse/HDDS-2679
Project: Hadoop Distributed Data Store
Issue Type: Improvement
Components: Ozone Datanode, SCM
Reporter: Marton Elek
The problem introduced with async pipeline creation:
# Let's say the SCM got registration from three datanodes.
# A Ratis/THREE pipeline will be created on SCM
# With the next HB Datanode1(DN1) will receive the CreatePipeline command
# Datanode1 will start the Ratis server which tries to get votes from DN2 and
DN3
# If DN2 has not yet received the CreatePipeline command (which has high
chance with 30sec HB) it will refuse to vote to DN1
# DN1 will request a pipeline close from the SCM as there are no votes from
DN2 and DN3
# Pipeline is closed on SCM side, but in the mean time DN2 (finally) receives
the pipeline creation command and tries to get votes, but DN1 has a newer
group/pipeline id.
# And so on
If we are lucky enough after a while all DN will receive the container creation
at more or less the same time, but if not, SCM couldn't create an Open Ratis
Possible solutions:
* At the very beginning datanode can trust in the peers and learn the group id
(but it doesn't cover the case when one pipeline has been closed on DN1 *and* a
new pipeline is created but DN2 still has the old pipeline).
* We can use bidirectional GRPC streaming for datanode scm communication
(which is a good idea anyway to make the communication faster). It makes the
communication faster but the problem is still there if there is a network blip
between scm and DN1
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]