sodonnel commented on code in PR #4618:
URL: https://github.com/apache/ozone/pull/4618#discussion_r1177521204


##########
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/StateContext.java:
##########
@@ -795,6 +800,11 @@ public SCMCommand getNextCommand() {
   public void addCommand(SCMCommand command) {
     lock.lock();
     try {
+      if (commandQueue.size() > maxCommandQueueLimit) {

Review Comment:
   I'm not sure about this check, when we have limit checks in other places 
too. Eg we could have 5000 replications queued on its queue and then another 
5000 here, so its a little confusing.
   
   The command queue implementation is single threaded, and processes each 
command in turn. These are the list of handlers it currently has:
   
   ```
           .addHandler(new CloseContainerCommandHandler())  // Blocks
           .addHandler(new DeleteBlocksCommandHandler(getContainer(),  // async
               conf, dnConf.getBlockDeleteThreads(),
               dnConf.getBlockDeleteQueueLimit()))
           .addHandler(new ReplicateContainerCommandHandler(conf, supervisor, 
// async
               pullReplicatorWithMetrics, pushReplicatorWithMetrics))
           .addHandler(reconstructECContainersCommandHandler)  // async
           .addHandler(new DeleteContainerCommandHandler(         // async
               dnConf.getContainerDeleteThreads(), clock))
           .addHandler(new ClosePipelineCommandHandler())  // blocks
           .addHandler(new CreatePipelineCommandHandler(conf))  // blocks
           .addHandler(new SetNodeOperationalStateCommandHandler(conf)) // 
blocks
           .addHandler(new FinalizeNewLayoutVersionCommandHandler()) // blocks
           .addHandler(new RefreshVolumeUsageCommandHandler())  // blocks
   ``` 
   
   So quite a few of the commands block processing other commands on the queue 
while they run. Most of these are probably very fast. The 
`RefreshVolumeUsageCommandHandler` worries me though. I am not sure what it 
does or how often it get called, but if it calls `du` it could be slow. Perhaps 
it should be async. Could closed and create pipeline potentially be slow too if 
there are some issues with Ratis election or the pipeline when closing the 
pipeline?
   
   When you saw problems with the DNs having a lot of queued commands, where 
were the commands queued? On the replication supervisor / delete block queues, 
or was the command queue also backed up with a lot of messages, indicating its 
thread was not able to process the messages onto the sub-queues fast enough, or 
some blocking command was taking a long time to run?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to