[
https://issues.apache.org/jira/browse/HDDS-11856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shangshu Qian updated HDDS-11856:
---------------------------------
Description:
Currently, the state machine thread and the command handler thread are created
without the priority setting, making them vulnerable to contentions with each
other.
StateMachineThread:
{code:java}
public void startDaemon() {
Runnable startStateMachineTask = () -> {
try {
LOG.info("Ozone container server started.");
startStateMachineThread();
} catch (Exception ex) {
LOG.error("Unable to start the DatanodeState Machine", ex);
}
};
stateMachineThread = new ThreadFactoryBuilder()
.setDaemon(true)
.setNameFormat(datanodeDetails.threadNamePrefix() +
"DatanodeStateMachineDaemonThread")
.setUncaughtExceptionHandler((Thread t, Throwable ex) -> {
String message = "Terminate Datanode, encounter uncaught exception"
+ " in Datanode State Machine Thread";
ExitUtils.terminate(1, message, ex, LOG);
})
.build().newThread(startStateMachineTask);
stateMachineThread.start();
} {code}
Command handler thread:
{code:java}
private void initCommandHandlerThread(ConfigurationSource config) { /*
Runnable processCommandQueue = () -> {
long now;
while (getContext().getState() != DatanodeStates.SHUTDOWN) {
SCMCommand<?> command = getContext().getNextCommand();
if (command != null) {
commandDispatcher.handle(command);
commandsHandled++;
} else {
...
}
}
}; // We will have only one thread for command processing in a datanode.
cmdProcessThread = getCommandHandlerThread(processCommandQueue);
cmdProcessThread.start();
}
private Thread getCommandHandlerThread(Runnable processCommandQueue) {
Thread handlerThread = new Thread(processCommandQueue);
handlerThread.setDaemon(true);
handlerThread.setName(
datanodeDetails.threadNamePrefix() + "CommandProcessorThread");
handlerThread.setUncaughtExceptionHandler((Thread t, Throwable e) -> {
LOG.error("Critical Error : Command processor thread encountered an " +
"error. Thread: {}", t.toString(), e);
getCommandHandlerThread(processCommandQueue).start();
});
return handlerThread;
} {code}
If the command handler is busy with a large amount of tasks, the state machine
thread can potentially be delayed. Since the state machine thread is also
sending heartbeat to the StorageContainerManager (SCM), delaying from command
handler thread may cause the system run into a feedback loop.
For example:
# The cluster has a large amount of write operations, resulting in a large
number of pipeline creations.
# Some DN got unresponsive due to overloading, making their HB delayed to the
SCM and being marked as dead nodes.
# The pipeline creations will be retried, and other nodes also need to
actively recover the data from the node in step 2. This result in more load
being pushed to the cluster.
Setting the state machine thread to a higher priority than the command handler
thread would make this problem less likely to happen.
was:
Currently, the state machine thread and the command handler thread are created
without the priority setting, making them vulnerable to contentions with each
other.
StateMachineThread:
{code:java}
public void startDaemon() {
Runnable startStateMachineTask = () -> {
try {
LOG.info("Ozone container server started.");
startStateMachineThread();
} catch (Exception ex) {
LOG.error("Unable to start the DatanodeState Machine", ex);
}
};
stateMachineThread = new ThreadFactoryBuilder()
.setDaemon(true)
.setNameFormat(datanodeDetails.threadNamePrefix() +
"DatanodeStateMachineDaemonThread")
.setUncaughtExceptionHandler((Thread t, Throwable ex) -> {
String message = "Terminate Datanode, encounter uncaught exception"
+ " in Datanode State Machine Thread";
ExitUtils.terminate(1, message, ex, LOG);
})
.build().newThread(startStateMachineTask);
stateMachineThread.start();
} {code}
Command handler thread:
{code:java}
private void initCommandHandlerThread(ConfigurationSource config) { /*
* Task that periodically checks if we have any outstanding commands.
* It is assumed that commands can be processed slowly and in order.
* This assumption might change in future. Right now due to this assumption
* we have single command queue process thread.
*/
Runnable processCommandQueue = () -> {
long now;
while (getContext().getState() != DatanodeStates.SHUTDOWN) {
SCMCommand<?> command = getContext().getNextCommand();
if (command != null) {
commandDispatcher.handle(command);
commandsHandled++;
} else {
try {
// Sleep till the next HB + 1 second.
now = Time.monotonicNow();
if (nextHB.get() > now) {
Thread.sleep((nextHB.get() - now) + 1000L);
}
} catch (InterruptedException e) {
// Ignore this exception.
Thread.currentThread().interrupt();
}
}
}
}; // We will have only one thread for command processing in a datanode.
cmdProcessThread = getCommandHandlerThread(processCommandQueue);
cmdProcessThread.start();
} private Thread getCommandHandlerThread(Runnable processCommandQueue) {
Thread handlerThread = new Thread(processCommandQueue);
handlerThread.setDaemon(true);
handlerThread.setName(
datanodeDetails.threadNamePrefix() + "CommandProcessorThread");
handlerThread.setUncaughtExceptionHandler((Thread t, Throwable e) -> {
// Let us just restart this thread after logging a critical error.
// if this thread is not running we cannot handle commands from SCM.
LOG.error("Critical Error : Command processor thread encountered an " +
"error. Thread: {}", t.toString(), e);
getCommandHandlerThread(processCommandQueue).start();
});
return handlerThread;
} {code}
If the command handler is busy with a large amount of tasks, the state machine
thread can potentially be delayed. Since the state machine thread is also
sending heartbeat to the StorageContainerManager (SCM), delaying from command
handler thread may cause the system run into a feedback loop.
For example:
# The cluster has a large amount of write operations, resulting in a large
number of pipeline creations.
# Some DN got unresponsive due to overloading, making their HB delayed to the
SCM and being marked as dead nodes.
# The pipeline creations will be retried, and other nodes also need to
actively recover the data from the node in step 2. This result in more load
being pushed to the cluster.
Setting the state machine thread to a higher priority than the command handler
thread would make this problem less likely to happen.
> The StateMachineThread on DataNode should have a higher priority than the
> CommandHandlerThread
> ----------------------------------------------------------------------------------------------
>
> Key: HDDS-11856
> URL: https://issues.apache.org/jira/browse/HDDS-11856
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 1.4.0
> Reporter: Shangshu Qian
> Priority: Major
>
> Currently, the state machine thread and the command handler thread are
> created without the priority setting, making them vulnerable to contentions
> with each other.
> StateMachineThread:
> {code:java}
> public void startDaemon() {
> Runnable startStateMachineTask = () -> {
> try {
> LOG.info("Ozone container server started.");
> startStateMachineThread();
> } catch (Exception ex) {
> LOG.error("Unable to start the DatanodeState Machine", ex);
> }
> };
> stateMachineThread = new ThreadFactoryBuilder()
> .setDaemon(true)
> .setNameFormat(datanodeDetails.threadNamePrefix() +
> "DatanodeStateMachineDaemonThread")
> .setUncaughtExceptionHandler((Thread t, Throwable ex) -> {
> String message = "Terminate Datanode, encounter uncaught exception"
> + " in Datanode State Machine Thread";
> ExitUtils.terminate(1, message, ex, LOG);
> })
> .build().newThread(startStateMachineTask);
> stateMachineThread.start();
> } {code}
> Command handler thread:
> {code:java}
> private void initCommandHandlerThread(ConfigurationSource config) { /*
> Runnable processCommandQueue = () -> {
> long now;
> while (getContext().getState() != DatanodeStates.SHUTDOWN) {
> SCMCommand<?> command = getContext().getNextCommand();
> if (command != null) {
> commandDispatcher.handle(command);
> commandsHandled++;
> } else {
> ...
> }
> }
> }; // We will have only one thread for command processing in a
> datanode.
> cmdProcessThread = getCommandHandlerThread(processCommandQueue);
> cmdProcessThread.start();
> }
> private Thread getCommandHandlerThread(Runnable processCommandQueue) {
> Thread handlerThread = new Thread(processCommandQueue);
> handlerThread.setDaemon(true);
> handlerThread.setName(
> datanodeDetails.threadNamePrefix() + "CommandProcessorThread");
> handlerThread.setUncaughtExceptionHandler((Thread t, Throwable e) -> {
> LOG.error("Critical Error : Command processor thread encountered an " +
> "error. Thread: {}", t.toString(), e);
> getCommandHandlerThread(processCommandQueue).start();
> });
> return handlerThread;
> } {code}
> If the command handler is busy with a large amount of tasks, the state
> machine thread can potentially be delayed. Since the state machine thread is
> also sending heartbeat to the StorageContainerManager (SCM), delaying from
> command handler thread may cause the system run into a feedback loop.
> For example:
> # The cluster has a large amount of write operations, resulting in a large
> number of pipeline creations.
> # Some DN got unresponsive due to overloading, making their HB delayed to
> the SCM and being marked as dead nodes.
> # The pipeline creations will be retried, and other nodes also need to
> actively recover the data from the node in step 2. This result in more load
> being pushed to the cluster.
> Setting the state machine thread to a higher priority than the command
> handler thread would make this problem less likely to happen.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]