[ 
https://issues.apache.org/jira/browse/KUDU-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Serbin updated KUDU-3500:
--------------------------------
    Description: 
While troubleshooting one performance issue where the prepare queue of a tablet 
was very long, I noticed that tablet servers start write operations that 
correspond to RPCs that have already timed out.  Most likely, the client that 
sent the RPC had already detected the timeout and expected that the write would 
have failed already, so there isn't much sense to start such operations anyway.

As a simple optimization, tablet servers shouldn't even start the PREPARE phase 
for such operations, but respond with TimedOut error status right away when 
dispatched them to the prepare thread.  Doing so would help with clearing the 
prepare queue and processing not-yet-timed-out requests from the queue faster, 
increasing the overall robustness of a tablet server when the load is high and 
the node's CPU and disk IO bandwidth are saturated.

A new metric should be introduced to track the number of WriteRequestPB RPCs 
timed out in the prepare queue and responded with TimedOut error status before 
starting the PREPARE phase for the corresponding operations.

  was:
While troubleshooting one performance issue where the prepare queue of a tablet 
was very long, I noticed that tablet servers start write operations that 
corresponds to RPCs that have already timed out.  Most likely, the client that 
sent the RPC has already detected the timeout and expects that the write had 
failed already, so there isn't much sense to start such operations anyway.

As a simple optimization, tablet servers shouldn't even start the PREPARE phase 
for such operations, but respond with TimedOut error status right away when 
such an operation is dispatched to the prepare thread.  Doing so would help 
with clearing the queue and processing not-yet-timed-out requests from the 
queue faster, increasing the overall robustness of a tablet server when the 
load is high and the node's CPU and disk IO bandwidth are saturated.

A new metric should be introduced to track the number of WriteRequestPB RPCs 
timed out in the prepare queue and responded with TimedOut error status before 
starting the PREPARE phase for the corresponding operations.


> Don't start write operations timed out in the tablet's prepare queue
> --------------------------------------------------------------------
>
>                 Key: KUDU-3500
>                 URL: https://issues.apache.org/jira/browse/KUDU-3500
>             Project: Kudu
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Alexey Serbin
>            Assignee: Alexey Serbin
>            Priority: Major
>
> While troubleshooting one performance issue where the prepare queue of a 
> tablet was very long, I noticed that tablet servers start write operations 
> that correspond to RPCs that have already timed out.  Most likely, the client 
> that sent the RPC had already detected the timeout and expected that the 
> write would have failed already, so there isn't much sense to start such 
> operations anyway.
> As a simple optimization, tablet servers shouldn't even start the PREPARE 
> phase for such operations, but respond with TimedOut error status right away 
> when dispatched them to the prepare thread.  Doing so would help with 
> clearing the prepare queue and processing not-yet-timed-out requests from the 
> queue faster, increasing the overall robustness of a tablet server when the 
> load is high and the node's CPU and disk IO bandwidth are saturated.
> A new metric should be introduced to track the number of WriteRequestPB RPCs 
> timed out in the prepare queue and responded with TimedOut error status 
> before starting the PREPARE phase for the corresponding operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to