This is an automated email from the ASF dual-hosted git repository.

alexey pushed a commit to branch branch-1.18.x
in repository https://gitbox.apache.org/repos/asf/kudu.git


The following commit(s) were added to refs/heads/branch-1.18.x by this push:
     new f857b4778 [rpc-test] fix flakiness in 
TimedOutOnResponseMetricServiceQueue
f857b4778 is described below

commit f857b4778078f05035bde5a30f22fabea12d1bbb
Author: Alexey Serbin <[email protected]>
AuthorDate: Sat Apr 5 11:14:25 2025 -0700

    [rpc-test] fix flakiness in TimedOutOnResponseMetricServiceQueue
    
    Before this patch, the TestRpc.TimedOutOnResponseMetricServiceQueue
    scenario would fail from time to time (about once in every 100 runs)
    when running on a VM in AWS cloud with error message like below:
    
      src/kudu/rpc/rpc-test.cc:1499: Failure
      Value of: s.IsTimedOut()
        Actual: false
      Expected: true
      Remote error: Got some error
    
    I haven't seen it if running on nodes backed by a dedicated hardware,
    and I guess that's due to scheduler anomalies induced by shared virtual
    environment.  However, I didn't look deeper and just increased the sleep
    time 10x and added 2x margin for RPC timeout.  With this patch, there
    hasn't been a single failure in more than 10K runs of the test scenario.
    The scenario still runs quite fast on modern hardware: less than 500ms.
    
    Change-Id: I70008836a38def70e097bc4547ac2a66e7203e35
    Reviewed-on: http://gerrit.cloudera.org:8080/22747
    Tested-by: Alexey Serbin <[email protected]>
    Reviewed-by: Abhishek Chennaka <[email protected]>
    (cherry picked from commit e65914e386339a11e61d5a8cb24d66bf606fd41a)
    Reviewed-on: http://gerrit.cloudera.org:8080/22797
---
 src/kudu/rpc/rpc-test.cc | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/kudu/rpc/rpc-test.cc b/src/kudu/rpc/rpc-test.cc
index 119807269..80b09f7d2 100644
--- a/src/kudu/rpc/rpc-test.cc
+++ b/src/kudu/rpc/rpc-test.cc
@@ -1410,7 +1410,7 @@ TEST_P(TestRpc, TimedOutOnResponseMetric) {
 // A special scenario for the per-RPC 'timed_out_on_response' metric when an
 // RPC times out while waiting in the queue, so it's not actually processed.
 TEST_P(TestRpc, TimedOutOnResponseMetricServiceQueue) {
-  constexpr uint64_t kSleepMicros = 20 * 1000;
+  constexpr uint64_t kSleepMicros = 200 * 1000;
   const string kMethodName = "Sleep";
 
   // Set RPC connection negotiation timeout to be very high to avoid flakiness
@@ -1473,7 +1473,9 @@ TEST_P(TestRpc, TimedOutOnResponseMetricServiceQueue) {
   req1.set_return_app_error(true);
   SleepResponsePB resp1;
   RpcController ctl1;
-  ctl1.set_timeout(MonoDelta::FromMicroseconds(kSleepMicros));
+  // Add an extra margin for the timeout setting to avoid flakiness
+  // due to scheduler anomalies and off-by-one differences in timestamps.
+  ctl1.set_timeout(MonoDelta::FromMicroseconds(kSleepMicros / 2));
   p.AsyncRequest(kMethodName, req1, &resp1, &ctl1,
                  [&latch]() { latch.CountDown(); });
 

Reply via email to