Alexey Serbin created KUDU-3500:
-----------------------------------
Summary: Don't start write operations timed out in the tablet's
prepare queue
Key: KUDU-3500
URL: https://issues.apache.org/jira/browse/KUDU-3500
Project: Kudu
Issue Type: Improvement
Components: tserver
Reporter: Alexey Serbin
While troubleshooting one performance issue where the prepare queue of a tablet
was very long, I noticed that tablet servers start write operations that
corresponds to RPCs that have already timed out. Most likely, the client that
sent the RPC has already detected the timeout and expects that the write had
failed already, so there isn't much sense to start such operations anyway.
As a simple optimization, tablet servers shouldn't even start the PREPARE phase
for such operations, but respond with TimedOut error status right away when
such an operation is dispatched to the prepare thread. Doing so would help
with clearing the queue and processing not-yet-timed-out requests from the
queue faster, increasing the overall robustness of a tablet server when the
load is high and the node's CPU and disk IO bandwidth are saturated.
A new metric should be introduced to track the number of WriteRequestPB RPCs
timed out in the prepare queue and responded with TimedOut error status before
starting the PREPARE phase for the corresponding operations.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)