This is an automated email from the ASF dual-hosted git repository.
alexey pushed a commit to branch branch-1.18.x
in repository https://gitbox.apache.org/repos/asf/kudu.git
The following commit(s) were added to refs/heads/branch-1.18.x by this push:
new f857b4778 [rpc-test] fix flakiness in
TimedOutOnResponseMetricServiceQueue
f857b4778 is described below
commit f857b4778078f05035bde5a30f22fabea12d1bbb
Author: Alexey Serbin <[email protected]>
AuthorDate: Sat Apr 5 11:14:25 2025 -0700
[rpc-test] fix flakiness in TimedOutOnResponseMetricServiceQueue
Before this patch, the TestRpc.TimedOutOnResponseMetricServiceQueue
scenario would fail from time to time (about once in every 100 runs)
when running on a VM in AWS cloud with error message like below:
src/kudu/rpc/rpc-test.cc:1499: Failure
Value of: s.IsTimedOut()
Actual: false
Expected: true
Remote error: Got some error
I haven't seen it if running on nodes backed by a dedicated hardware,
and I guess that's due to scheduler anomalies induced by shared virtual
environment. However, I didn't look deeper and just increased the sleep
time 10x and added 2x margin for RPC timeout. With this patch, there
hasn't been a single failure in more than 10K runs of the test scenario.
The scenario still runs quite fast on modern hardware: less than 500ms.
Change-Id: I70008836a38def70e097bc4547ac2a66e7203e35
Reviewed-on: http://gerrit.cloudera.org:8080/22747
Tested-by: Alexey Serbin <[email protected]>
Reviewed-by: Abhishek Chennaka <[email protected]>
(cherry picked from commit e65914e386339a11e61d5a8cb24d66bf606fd41a)
Reviewed-on: http://gerrit.cloudera.org:8080/22797
---
src/kudu/rpc/rpc-test.cc | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/src/kudu/rpc/rpc-test.cc b/src/kudu/rpc/rpc-test.cc
index 119807269..80b09f7d2 100644
--- a/src/kudu/rpc/rpc-test.cc
+++ b/src/kudu/rpc/rpc-test.cc
@@ -1410,7 +1410,7 @@ TEST_P(TestRpc, TimedOutOnResponseMetric) {
// A special scenario for the per-RPC 'timed_out_on_response' metric when an
// RPC times out while waiting in the queue, so it's not actually processed.
TEST_P(TestRpc, TimedOutOnResponseMetricServiceQueue) {
- constexpr uint64_t kSleepMicros = 20 * 1000;
+ constexpr uint64_t kSleepMicros = 200 * 1000;
const string kMethodName = "Sleep";
// Set RPC connection negotiation timeout to be very high to avoid flakiness
@@ -1473,7 +1473,9 @@ TEST_P(TestRpc, TimedOutOnResponseMetricServiceQueue) {
req1.set_return_app_error(true);
SleepResponsePB resp1;
RpcController ctl1;
- ctl1.set_timeout(MonoDelta::FromMicroseconds(kSleepMicros));
+ // Add an extra margin for the timeout setting to avoid flakiness
+ // due to scheduler anomalies and off-by-one differences in timestamps.
+ ctl1.set_timeout(MonoDelta::FromMicroseconds(kSleepMicros / 2));
p.AsyncRequest(kMethodName, req1, &resp1, &ctl1,
[&latch]() { latch.CountDown(); });