[
https://issues.apache.org/jira/browse/HBASE-30265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HBASE-30265:
-----------------------------------
Labels: pull-request-available (was: )
> Fix flaky TestProcDispatcher.testRetryLimitOnConnClosedErrors
> -------------------------------------------------------------
>
> Key: HBASE-30265
> URL: https://issues.apache.org/jira/browse/HBASE-30265
> Project: HBase
> Issue Type: Bug
> Components: test
> Reporter: Xiao Liu
> Assignee: Xiao Liu
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.5.16, 2.6.7
>
>
> h3. Symptom
> {{TestProcDispatcher.testRetryLimitOnConnClosedErrors}} fails intermittently
> (seen https://github.com/apache/hbase/actions/runs/28292988640): it times out
> in the first
> {{waitFor}} with {{Num of SCPs: 0}}.
> h3. Root cause
> The test helper {{RSProcDispatcher}} decides when to inject connection errors
> from a global static {{sendRequest()}} call count (throw on the 8th-13th /
> 18th-23rd call). {{remoteDispatch()}} runs for *every* remote procedure in the
> cluster (startup, table creation, flush/compact, chores, assignments), so the
> number of background dispatches before the test's region moves is
> nondeterministic. On a busy run the counter is already past the injection
> window by the time the moves happen, so no error is injected, the fail-fast
> retry limit is never reached, no {{ServerCrashProcedure}} is scheduled, and
> the
> assertion times out.
> h3. Fix
> Bind error injection to the open/close-region requests of the test's own table
> (driven explicitly by the test) instead of a global counter, so it
> deterministically targets the operations under test regardless of background
> activity.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)