Anton Vinogradov created IGNITE-28834:
-----------------------------------------
Summary: Test load scaling (TEST_SCALE_FACTOR / GridTestUtils.SF)
is inert in RunAll and unevenly applied
Key: IGNITE-28834
URL: https://issues.apache.org/jira/browse/IGNITE-28834
Project: Ignite
Issue Type: Task
Reporter: Anton Vinogradov
Assignee: Anton Vinogradov
h2. Problem
The test-load scaling facility ({{{}GridTestUtils.SF{}}} /
{{{}ScaleFactorUtil{}}}, driven by the
{{TEST_SCALE_FACTOR}} system property, range [0.1, 1.0]) is intended to shrink
load loops,
iteration counts and time-boxed durations on CI. Two issues were found:
h1. _The factor never reached the test JVMs in RunAll._ The
\{{IgniteTests24Java8_RunAll}}
configuration declared a single \{{reverse.dep.*.TEST_SCALE_FACTOR = 0.1}}, but
it did not
propagate to the leaf test suites (intermediate composite builds break the
wildcard
reverse-dep). Each suite also carried its own \{{TEST_SCALE_FACTOR}} (template
default \{{1.0}},
plus \{{0.1}} on Snapshots×8 and \{{0.2}} on SnapshotsWithIndexes). The forked
test JVMs in actual
RunAll builds received \{{-DTEST_SCALE_FACTOR=1.0}} everywhere — i.e. scaling
was effectively
disabled and all tests ran at full size.
h1. _Coverage is uneven._ Measured over one RunAll run (build #9159468, 37.5h /
~70k test runs):
only ~28.8% of test wall-time is in classes that use \{{SF}} anywhere in their
hierarchy. ~17%
is in non-covered classes that have a clearly scalable load constant; ~54% is
topology/fan-out
(grid start/stop, parametrization) where \{{SF}} does not help.
h2. Solution
_Part 1 — CI configuration (done in TeamCity; settings are not versioned in the
repo):_
* Set \{{TEST_SCALE_FACTOR = 0.1}} once in the \{{Run tests (Java)}} template
(single source of truth).
* Removed the per-suite overrides (Snapshots×8, SnapshotsWithIndexes,
RollingUpgrade).
RollingUpgrade uses a different template, so it keeps an explicit own \{{0.1}}.
* Removed the dead \{{reverse.dep.*.TEST_SCALE_FACTOR}} from RunAll.
* Result: all 154 param-bearing build configs now resolve to \{{0.1}}; verify
a suite forks with
{\{-DTEST_SCALE_FACTOR=0.1}} on the next RunAll.
_Part 2 — Add SF scaling to long-running, non-covered tests (this PR):_
Added \{{GridTestUtils.SF.applyLB(...)}} (with safe lower bounds, so the load
does not collapse at
0.1) to the heaviest tests whose time is dominated by scalable load loops /
time-boxes:
|| Class || Scaled ||
| IgniteTxCacheWriteSynchronizationModesMultithreadedTest | load window 10s |
| TxDeadlockDetectionNoHangsTest / TxDeadlockDetectionTest | 2-min run window |
| IgniteCacheGetRestartTest | TEST_TIME, KEYS |
| CrossCacheTxRandomOperationsTest | 10s window |
| SegmentedRingByteBufferTest | 60s producer/consumer windows |
| TxPartitionCounterStateConsistencyTest | 30s restart windows |
| IgnitePdsTransactionsHangTest | DURATION (kept > warm-up) |
| CacheFreeListSelfTest | grow/shrink load (200k → LB 50k) |
| ConcurrentCheckpointAndUpdateTtlTest | checkpoint loop |
| IgniteCachePutAllRestartTest | 2-min + 60s put windows |
Lower bounds keep each scenario meaningful at 0.1 (e.g. ≥20s for
deadlock-detection,
≥10s for tx/restart windows, ≥50k entries for free-list grow/shrink).
h2. Scope / stopping criterion
Candidates were stopped at the point where per-test saving drops below ~1
minute.
Beyond the classes above, the next tier yields ~50s/test, and the remainder is
topology- or
search-bound (no scalable load). Explicitly _not_ changed:
* {\{GridCommandHandlerConsistencyCountersTest}} — the \{{2_000}} preload is a
semantic threshold
("enough for historical rebalance"), not load.
* affinity-key search loops mis-detected as load (e.g.
{\{CacheContinuousQueryAsyncFilterListenerTest}},
\{{IgniteCacheClientNodeChangingTopologyTest}}).
* {\{CacheJdbcPojoWriteBehindStoreWithCoalescingTest}} — hang/coalescing
regression test where
volume matters; scaling could weaken it.
h2. Validation
* {\{mvn test-compile -pl modules/core}} green (JDK 17; note: JDK 21+ fails on
unrelated
{\{Thread.suspend/resume}} usages).
* Per-test durations taken from RunAll build #9159468.
h2. Follow-up (separate tickets)
* The remaining ~54% (CDC suites, snapshot-restore) is fan-out/IO-bound —
needs reduced data
volume in \{{AbstractCdcTest}} / \{{AbstractSnapshotSelfTest}} or suite
re-balancing, evaluated
for coverage impact separately.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)