sodonnel commented on code in PR #4618:
URL: https://github.com/apache/ozone/pull/4618#discussion_r1177521204
##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java:
##########
@@ -795,6 +800,11 @@ public SCMCommand getNextCommand() {
public void addCommand(SCMCommand command) {
lock.lock();
try {
+ if (commandQueue.size() > maxCommandQueueLimit) {
Review Comment:
I'm not sure about this check, when we have limit checks in other places
too. Eg we could have 5000 replications queued on its queue and then another
5000 here, so its a little confusing.
The command queue implementation is single threaded, and processes each
command in turn. These are the list of handlers it currently has:
```
.addHandler(new CloseContainerCommandHandler()) // Blocks
.addHandler(new DeleteBlocksCommandHandler(getContainer(), // async
conf, dnConf.getBlockDeleteThreads(),
dnConf.getBlockDeleteQueueLimit()))
.addHandler(new ReplicateContainerCommandHandler(conf, supervisor,
// async
pullReplicatorWithMetrics, pushReplicatorWithMetrics))
.addHandler(reconstructECContainersCommandHandler) // async
.addHandler(new DeleteContainerCommandHandler( // async
dnConf.getContainerDeleteThreads(), clock))
.addHandler(new ClosePipelineCommandHandler()) // blocks
.addHandler(new CreatePipelineCommandHandler(conf)) // blocks
.addHandler(new SetNodeOperationalStateCommandHandler(conf)) //
blocks
.addHandler(new FinalizeNewLayoutVersionCommandHandler()) // blocks
.addHandler(new RefreshVolumeUsageCommandHandler()) // blocks
```
So quite a few of the commands block processing other commands on the queue
while they run. Most of these are probably very fast. The
`RefreshVolumeUsageCommandHandler` worries me though. I am not sure what it
does or how often it get called, but if it calls `du` it could be slow. Perhaps
it should be async. Could closed and create pipeline potentially be slow too if
there are some issues with Ratis election or the pipeline when closing the
pipeline?
When you saw problems with the DNs having a lot of queued commands, where
were the commands queued? On the replication supervisor / delete block queues,
or was the command queue also backed up with a lot of messages, indicating its
thread was not able to process the messages onto the sub-queues fast enough, or
some blocking command was taking a long time to run?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]