[
https://issues.apache.org/jira/browse/HDDS-11856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-11856:
-------------------------------
Target Version/s: 2.1.0, 1.4.2 (was: 2.1.0)
> The StateMachineThread on DataNode should have a higher priority than the
> CommandHandlerThread
> ----------------------------------------------------------------------------------------------
>
> Key: HDDS-11856
> URL: https://issues.apache.org/jira/browse/HDDS-11856
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 1.4.0
> Reporter: Shangshu Qian
> Assignee: Ashish Kumar
> Priority: Major
> Labels: pull-request-available
>
> Currently, the state machine thread and the command handler thread are
> created without the priority setting, making them vulnerable to contentions
> with each other.
> StateMachineThread:
> {code:java}
> public void startDaemon() {
> Runnable startStateMachineTask = () -> {
> try {
> LOG.info("Ozone container server started.");
> startStateMachineThread();
> } catch (Exception ex) {
> LOG.error("Unable to start the DatanodeState Machine", ex);
> }
> };
> stateMachineThread = new ThreadFactoryBuilder()
> .setDaemon(true)
> .setNameFormat(datanodeDetails.threadNamePrefix() +
> "DatanodeStateMachineDaemonThread")
> .setUncaughtExceptionHandler((Thread t, Throwable ex) -> {
> String message = "Terminate Datanode, encounter uncaught exception"
> + " in Datanode State Machine Thread";
> ExitUtils.terminate(1, message, ex, LOG);
> })
> .build().newThread(startStateMachineTask);
> stateMachineThread.start();
> } {code}
> Command handler thread:
> {code:java}
> private void initCommandHandlerThread(ConfigurationSource config) { /*
> Runnable processCommandQueue = () -> {
> long now;
> while (getContext().getState() != DatanodeStates.SHUTDOWN) {
> SCMCommand<?> command = getContext().getNextCommand();
> if (command != null) {
> commandDispatcher.handle(command);
> commandsHandled++;
> } else {
> ...
> }
> }
> }; // We will have only one thread for command processing in a
> datanode.
> cmdProcessThread = getCommandHandlerThread(processCommandQueue);
> cmdProcessThread.start();
> }
> private Thread getCommandHandlerThread(Runnable processCommandQueue) {
> Thread handlerThread = new Thread(processCommandQueue);
> handlerThread.setDaemon(true);
> handlerThread.setName(
> datanodeDetails.threadNamePrefix() + "CommandProcessorThread");
> handlerThread.setUncaughtExceptionHandler((Thread t, Throwable e) -> {
> LOG.error("Critical Error : Command processor thread encountered an " +
> "error. Thread: {}", t.toString(), e);
> getCommandHandlerThread(processCommandQueue).start();
> });
> return handlerThread;
> } {code}
> If the command handler is busy with a large amount of tasks, the state
> machine thread can potentially be delayed. Since the state machine thread is
> also sending heartbeat to the StorageContainerManager (SCM), delaying from
> command handler thread may cause the system run into a feedback loop.
> For example:
> # The cluster has a large amount of write operations, resulting in a large
> number of pipeline creations.
> # Some DN got unresponsive due to overloading, making their HB delayed to
> the SCM and being marked as dead nodes.
> # The pipeline creations will be retried, and other nodes also need to
> actively recover the data from the node in step 2. This result in more load
> being pushed to the cluster.
> Setting the state machine thread to a higher priority than the command
> handler thread would make this problem less likely to happen.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]