[jira] [Updated] (HDDS-11856) The StateMachineThread on DataNode should have a higher priority than the CommandHandlerThread

ASF GitHub Bot (Jira) Wed, 09 Apr 2025 01:15:38 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-11856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HDDS-11856:
----------------------------------
    Labels: pull-request-available  (was: )

> The StateMachineThread on DataNode should have a higher priority than the 
> CommandHandlerThread
> ----------------------------------------------------------------------------------------------
>
>                 Key: HDDS-11856
>                 URL: https://issues.apache.org/jira/browse/HDDS-11856
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.4.0
>            Reporter: Shangshu Qian
>            Assignee: Ashish Kumar
>            Priority: Major
>              Labels: pull-request-available
>
> Currently, the state machine thread and the command handler thread are 
> created without the priority setting, making them vulnerable to contentions 
> with each other.
> StateMachineThread:
> {code:java}
>   public void startDaemon() {
>     Runnable startStateMachineTask = () -> {
>       try {
>         LOG.info("Ozone container server started.");
>         startStateMachineThread();
>       } catch (Exception ex) {
>         LOG.error("Unable to start the DatanodeState Machine", ex);
>       }
>     };
>     stateMachineThread =  new ThreadFactoryBuilder()
>         .setDaemon(true)
>         .setNameFormat(datanodeDetails.threadNamePrefix() +
>             "DatanodeStateMachineDaemonThread")
>         .setUncaughtExceptionHandler((Thread t, Throwable ex) -> {
>           String message = "Terminate Datanode, encounter uncaught exception"
>               + " in Datanode State Machine Thread";
>           ExitUtils.terminate(1, message, ex, LOG);
>         })
>         .build().newThread(startStateMachineTask);
>     stateMachineThread.start();
>   } {code}
> Command handler thread:
> {code:java}
>   private void initCommandHandlerThread(ConfigurationSource config) {    /*
>     Runnable processCommandQueue = () -> {
>       long now;
>       while (getContext().getState() != DatanodeStates.SHUTDOWN) {
>         SCMCommand<?> command = getContext().getNextCommand();
>         if (command != null) {
>           commandDispatcher.handle(command);
>           commandsHandled++;
>         } else {
> ...
>         }
>       }
>     };    // We will have only one thread for command processing in a 
> datanode.
>     cmdProcessThread = getCommandHandlerThread(processCommandQueue);
>     cmdProcessThread.start();
>   }  
> private Thread getCommandHandlerThread(Runnable processCommandQueue) {
>     Thread handlerThread = new Thread(processCommandQueue);
>     handlerThread.setDaemon(true);
>     handlerThread.setName(
>         datanodeDetails.threadNamePrefix() + "CommandProcessorThread");
>     handlerThread.setUncaughtExceptionHandler((Thread t, Throwable e) -> {
>       LOG.error("Critical Error : Command processor thread encountered an " +
>           "error. Thread: {}", t.toString(), e);
>       getCommandHandlerThread(processCommandQueue).start();
>     });
>     return handlerThread;
>   } {code}
> If the command handler is busy with a large amount of tasks, the state 
> machine thread can potentially be delayed. Since the state machine thread is 
> also sending heartbeat to the StorageContainerManager (SCM), delaying from 
> command handler thread may cause the system run into a feedback loop.
> For example:
>  # The cluster has a large amount of write operations, resulting in a large 
> number of pipeline creations.
>  # Some DN got unresponsive due to overloading, making their HB delayed to 
> the SCM and being marked as dead nodes.
>  # The pipeline creations will be retried, and other nodes also need to 
> actively recover the data from the node in step 2. This result in more load 
> being pushed to the cluster.
> Setting the state machine thread to a higher priority than the command 
> handler thread would make this problem less likely to happen.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

[jira] [Updated] (HDDS-11856) The StateMachineThread on DataNode should have a higher priority than the CommandHandlerThread

Reply via email to