Stephen O'Donnell created HDDS-5330:
---------------------------------------
Summary: Datanode commands are not always invalidated when the SCM
leader switches
Key: HDDS-5330
URL: https://issues.apache.org/jira/browse/HDDS-5330
Project: Apache Ozone
Issue Type: Bug
Reporter: Stephen O'Donnell
A datanode should only process commands from the SCM which is the leader.
In StateContext.getNextCommand(), there is logic to update the current leader
SCM term for each command seen on the DN. It picks the command, and the updates
the term based on the term stored in the command:
{code}
updateTermOfLeaderSCM(command);
if (command.getTerm() == termOfLeaderSCM.get()) {
return command;
}
{code}
There are a few problems here:
1) If there are commands in the queue with a newer term, then the term stored
in the DN will not be updated until all the pending commands have been
processed. Therefore the SCM switch can have happened and the stale commands
continue to be processed.
2) While there is a single command queue, there are further sub-queues. For
example DeleteContainerCommandHandler places the commands into a executor
queue. Similar for ReplicateContainerCommandHandler. These queues could be
quite large and hence stale commands could be processed.
For (1), I believe the term should be updated when the commands are enqueued,
not dequeued. That would ensure new commands update the term and invalidate the
old commands immediately.
For (2), we should check the DN term prior to executing the command and drop
the command if it is no longer valid.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]