[ https://issues.apache.org/jira/browse/HDDS-11856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HDDS-11856: ---------------------------------- Labels: pull-request-available (was: ) > The StateMachineThread on DataNode should have a higher priority than the > CommandHandlerThread > ---------------------------------------------------------------------------------------------- > > Key: HDDS-11856 > URL: https://issues.apache.org/jira/browse/HDDS-11856 > Project: Apache Ozone > Issue Type: Bug > Components: Ozone Datanode > Affects Versions: 1.4.0 > Reporter: Shangshu Qian > Assignee: Ashish Kumar > Priority: Major > Labels: pull-request-available > > Currently, the state machine thread and the command handler thread are > created without the priority setting, making them vulnerable to contentions > with each other. > StateMachineThread: > {code:java} > public void startDaemon() { > Runnable startStateMachineTask = () -> { > try { > LOG.info("Ozone container server started."); > startStateMachineThread(); > } catch (Exception ex) { > LOG.error("Unable to start the DatanodeState Machine", ex); > } > }; > stateMachineThread = new ThreadFactoryBuilder() > .setDaemon(true) > .setNameFormat(datanodeDetails.threadNamePrefix() + > "DatanodeStateMachineDaemonThread") > .setUncaughtExceptionHandler((Thread t, Throwable ex) -> { > String message = "Terminate Datanode, encounter uncaught exception" > + " in Datanode State Machine Thread"; > ExitUtils.terminate(1, message, ex, LOG); > }) > .build().newThread(startStateMachineTask); > stateMachineThread.start(); > } {code} > Command handler thread: > {code:java} > private void initCommandHandlerThread(ConfigurationSource config) { /* > Runnable processCommandQueue = () -> { > long now; > while (getContext().getState() != DatanodeStates.SHUTDOWN) { > SCMCommand<?> command = getContext().getNextCommand(); > if (command != null) { > commandDispatcher.handle(command); > commandsHandled++; > } else { > ... > } > } > }; // We will have only one thread for command processing in a > datanode. > cmdProcessThread = getCommandHandlerThread(processCommandQueue); > cmdProcessThread.start(); > } > private Thread getCommandHandlerThread(Runnable processCommandQueue) { > Thread handlerThread = new Thread(processCommandQueue); > handlerThread.setDaemon(true); > handlerThread.setName( > datanodeDetails.threadNamePrefix() + "CommandProcessorThread"); > handlerThread.setUncaughtExceptionHandler((Thread t, Throwable e) -> { > LOG.error("Critical Error : Command processor thread encountered an " + > "error. Thread: {}", t.toString(), e); > getCommandHandlerThread(processCommandQueue).start(); > }); > return handlerThread; > } {code} > If the command handler is busy with a large amount of tasks, the state > machine thread can potentially be delayed. Since the state machine thread is > also sending heartbeat to the StorageContainerManager (SCM), delaying from > command handler thread may cause the system run into a feedback loop. > For example: > # The cluster has a large amount of write operations, resulting in a large > number of pipeline creations. > # Some DN got unresponsive due to overloading, making their HB delayed to > the SCM and being marked as dead nodes. > # The pipeline creations will be retried, and other nodes also need to > actively recover the data from the node in step 2. This result in more load > being pushed to the cluster. > Setting the state machine thread to a higher priority than the command > handler thread would make this problem less likely to happen. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org