[jira] [Created] (IGNITE-28834) Test load scaling (TEST_SCALE_FACTOR / GridTestUtils.SF) is inert in RunAll and unevenly applied

Anton Vinogradov (Jira) Mon, 29 Jun 2026 14:02:15 -0700

Anton Vinogradov created IGNITE-28834:
-----------------------------------------


             Summary: Test load scaling (TEST_SCALE_FACTOR / GridTestUtils.SF) 
is inert in RunAll and unevenly applied
                 Key: IGNITE-28834
                 URL: https://issues.apache.org/jira/browse/IGNITE-28834
             Project: Ignite
          Issue Type: Task
            Reporter: Anton Vinogradov
            Assignee: Anton Vinogradov


h2. Problem

The test-load scaling facility ({{{}GridTestUtils.SF{}}} / 
{{{}ScaleFactorUtil{}}}, driven by the
{{TEST_SCALE_FACTOR}} system property, range [0.1, 1.0]) is intended to shrink 
load loops,
iteration counts and time-boxed durations on CI. Two issues were found:
h1. _The factor never reached the test JVMs in RunAll._ The 
\{{IgniteTests24Java8_RunAll}}

configuration declared a single \{{reverse.dep.*.TEST_SCALE_FACTOR = 0.1}}, but 
it did not
propagate to the leaf test suites (intermediate composite builds break the 
wildcard
reverse-dep). Each suite also carried its own \{{TEST_SCALE_FACTOR}} (template 
default \{{1.0}},
plus \{{0.1}} on Snapshots×8 and \{{0.2}} on SnapshotsWithIndexes). The forked 
test JVMs in actual
RunAll builds received \{{-DTEST_SCALE_FACTOR=1.0}} everywhere — i.e. scaling 
was effectively
disabled and all tests ran at full size.
h1. _Coverage is uneven._ Measured over one RunAll run (build #9159468, 37.5h / 
~70k test runs):

only ~28.8% of test wall-time is in classes that use \{{SF}} anywhere in their 
hierarchy. ~17%
is in non-covered classes that have a clearly scalable load constant; ~54% is 
topology/fan-out
(grid start/stop, parametrization) where \{{SF}} does not help.

h2. Solution

_Part 1 — CI configuration (done in TeamCity; settings are not versioned in the 
repo):_
 * Set \{{TEST_SCALE_FACTOR = 0.1}} once in the \{{Run tests (Java)}} template 
(single source of truth).
 * Removed the per-suite overrides (Snapshots×8, SnapshotsWithIndexes, 
RollingUpgrade).
RollingUpgrade uses a different template, so it keeps an explicit own \{{0.1}}.
 * Removed the dead \{{reverse.dep.*.TEST_SCALE_FACTOR}} from RunAll.
 * Result: all 154 param-bearing build configs now resolve to \{{0.1}}; verify 
a suite forks with
{\{-DTEST_SCALE_FACTOR=0.1}} on the next RunAll.

_Part 2 — Add SF scaling to long-running, non-covered tests (this PR):_
Added \{{GridTestUtils.SF.applyLB(...)}} (with safe lower bounds, so the load 
does not collapse at
0.1) to the heaviest tests whose time is dominated by scalable load loops / 
time-boxes:

|| Class || Scaled ||
| IgniteTxCacheWriteSynchronizationModesMultithreadedTest | load window 10s |
| TxDeadlockDetectionNoHangsTest / TxDeadlockDetectionTest | 2-min run window |
| IgniteCacheGetRestartTest | TEST_TIME, KEYS |
| CrossCacheTxRandomOperationsTest | 10s window |
| SegmentedRingByteBufferTest | 60s producer/consumer windows |
| TxPartitionCounterStateConsistencyTest | 30s restart windows |
| IgnitePdsTransactionsHangTest | DURATION (kept > warm-up) |
| CacheFreeListSelfTest | grow/shrink load (200k → LB 50k) |
| ConcurrentCheckpointAndUpdateTtlTest | checkpoint loop |
| IgniteCachePutAllRestartTest | 2-min + 60s put windows |

Lower bounds keep each scenario meaningful at 0.1 (e.g. ≥20s for 
deadlock-detection,
≥10s for tx/restart windows, ≥50k entries for free-list grow/shrink).

h2. Scope / stopping criterion

Candidates were stopped at the point where per-test saving drops below ~1 
minute.
Beyond the classes above, the next tier yields ~50s/test, and the remainder is 
topology- or
search-bound (no scalable load). Explicitly _not_ changed:
 * {\{GridCommandHandlerConsistencyCountersTest}} — the \{{2_000}} preload is a 
semantic threshold
("enough for historical rebalance"), not load.
 * affinity-key search loops mis-detected as load (e.g.
{\{CacheContinuousQueryAsyncFilterListenerTest}}, 
\{{IgniteCacheClientNodeChangingTopologyTest}}).
 * {\{CacheJdbcPojoWriteBehindStoreWithCoalescingTest}} — hang/coalescing 
regression test where
volume matters; scaling could weaken it.

h2. Validation
 * {\{mvn test-compile -pl modules/core}} green (JDK 17; note: JDK 21+ fails on 
unrelated
{\{Thread.suspend/resume}} usages).
 * Per-test durations taken from RunAll build #9159468.

h2. Follow-up (separate tickets)
 * The remaining ~54% (CDC suites, snapshot-restore) is fan-out/IO-bound — 
needs reduced data
volume in \{{AbstractCdcTest}} / \{{AbstractSnapshotSelfTest}} or suite 
re-balancing, evaluated
for coverage impact separately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IGNITE-28834) Test load scaling (TEST_SCALE_FACTOR / GridTestUtils.SF) is inert in RunAll and unevenly applied

Reply via email to