[
https://issues.apache.org/jira/browse/HDDS-5525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18059966#comment-18059966
]
Ilya commented on HDDS-5525:
----------------------------
I would like to ask you a question about the assignment, clarify my
understanding, and ask for feedback if I'm making a mistake somewhere.
1) As far as I understand, the statement "In HDDS-5513, a race condition
occurred because snapshot installation was occurring before the main
DatanodeStateMachine loop" is incorrect, because
ContainerStateMachine#loadSnapshot is just a side effect due to an incorrect
initialization chain in ratis 2.1.0, which led to the creation of
OzoneContainer in the DatanodeStateMachine constructor
XceiverServerRatis#notifyGroupAdd() and as a result, to
ContainerStateMachine#initialize (which is where loadSnapshot is performed). In
other words, the exception from HDDS-5513 arose precisely because of ratis, the
code of which was later fixed in RATIS-1465 (commit 53a3eaa).
The exception from HDDS-5513 is easy to reproduce if:
1) there will be an action for at least one LayoutFeature to enter the
BasicUpgradeFinalizer#runFirstUpgradeAction
2) stop the thread inside runFirstUpgradeAction
3) add a small delay to ContainerStateMachine#initialize before
XceiverServerRatis#notifyGroupAdd() so that the DatanodeStateMachine thread has
time to reach the stop point
(This is not possible after updating the ratis linked above)
2) On the other hand, if you remove the HDDS-5513 changes (commit d405ebf),
then not from the point of view of the consistency of the state machine, but
from the point of view of the public DatanodeStateMachine#triggerHeartbeat
method, this exception can still be reproduced in the test environment, even
with the fixed ratis, which means that the HDDS-5513 fix (commit d405ebf) takes
place. to be still.
3) "but this will pose a problem if we need to run pre-finalize actions
involving container data in the future" do I understand correctly that
interaction with container data means the same interaction via
triggerHeartbeat()?
If so, then the problem is clear and the solution is unclear, because
pre-finalize actions occur before the main DatanodeStateMachine loop and the
execution of context.execute() (as mentioned earlier)
If we are talking about loadSnapshot, then with the proven absence of
involvement in the data race (I repeat that the reason was in ratis), the
problem becomes even more difficult to imagine.
I would like an example, because at the moment it is not entirely clear what
exactly it is about, given the inconsistency of loadSnapshot and the data race,
which I cited in paragraph (1)
If I'm wrong about something, correct me.
> Datanode snapshot can be installed while pre-finalize actions are running
> -------------------------------------------------------------------------
>
> Key: HDDS-5525
> URL: https://issues.apache.org/jira/browse/HDDS-5525
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ethan Rose
> Priority: Major
>
> In HDDS-5513, a race condition occurred because snapshot installation was
> occurring before the main DatanodeStateMachine loop, and therefore occurring
> while pre-finalize actions could be running. In that Jira a workaround was
> implemented to unblock upgrades, but this will pose a problem if we need to
> run pre-finalize actions involving container data in the future.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]