[
https://issues.apache.org/jira/browse/IGNITE-28834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092608#comment-18092608
]
Ignite TC Bot commented on IGNITE-28834:
----------------------------------------
{panel:title=Branch: [pull/13294/head] Base: [master] : No blockers
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#D6F7C1}{panel}
{panel:title=Branch: [pull/13294/head] Base: [master] : No new tests
found!|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1}{panel}
[TeamCity *--> Run :: All*
Results|https://ci2.ignite.apache.org/viewLog.html?buildId=9161408&buildTypeId=IgniteTests24Java8_RunAll]
{color:#ffffff}tcbot-analysis-comment chainBuildId=9161408
rerunBuildIds=none{color}
> Test load scaling (TEST_SCALE_FACTOR / GridTestUtils.SF) is inert in RunAll
> and unevenly applied
> ------------------------------------------------------------------------------------------------
>
> Key: IGNITE-28834
> URL: https://issues.apache.org/jira/browse/IGNITE-28834
> Project: Ignite
> Issue Type: Task
> Reporter: Anton Vinogradov
> Assignee: Anton Vinogradov
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> h2. Problem
> The test-load scaling facility ({{{}GridTestUtils.SF{}}} /
> {{{}ScaleFactorUtil{}}}, driven by the
> {{TEST_SCALE_FACTOR}} system property, range [0.1, 1.0]) is intended to
> shrink load loops,
> iteration counts and time-boxed durations on CI. Two issues were found:
> h1. _The factor never reached the test JVMs in RunAll._ The
> \{{IgniteTests24Java8_RunAll}}
> configuration declared a single \{{reverse.dep.*.TEST_SCALE_FACTOR = 0.1}},
> but it did not
> propagate to the leaf test suites (intermediate composite builds break the
> wildcard
> reverse-dep). Each suite also carried its own \{{TEST_SCALE_FACTOR}}
> (template default \{{1.0}},
> plus \{{0.1}} on Snapshots×8 and \{{0.2}} on SnapshotsWithIndexes). The
> forked test JVMs in actual
> RunAll builds received \{{-DTEST_SCALE_FACTOR=1.0}} everywhere — i.e. scaling
> was effectively
> disabled and all tests ran at full size.
> h1. _Coverage is uneven._ Measured over one RunAll run (build #9159468, 37.5h
> / ~70k test runs):
> only ~28.8% of test wall-time is in classes that use \{{SF}} anywhere in
> their hierarchy. ~17%
> is in non-covered classes that have a clearly scalable load constant; ~54% is
> topology/fan-out
> (grid start/stop, parametrization) where \{{SF}} does not help.
> h2. Solution
> _Part 1 — CI configuration (done in TeamCity; settings are not versioned in
> the repo):_
> * Set \{{TEST_SCALE_FACTOR = 0.1}} once in the \{{Run tests (Java)}}
> template (single source of truth).
> * Removed the per-suite overrides (Snapshots×8, SnapshotsWithIndexes,
> RollingUpgrade).
> RollingUpgrade uses a different template, so it keeps an explicit own
> \{{0.1}}.
> * Removed the dead \{{reverse.dep.*.TEST_SCALE_FACTOR}} from RunAll.
> * Result: all 154 param-bearing build configs now resolve to \{{0.1}};
> verify a suite forks with
> {\{-DTEST_SCALE_FACTOR=0.1}} on the next RunAll.
> _Part 2 — Add SF scaling to long-running, non-covered tests (this PR):_
> Added \{{GridTestUtils.SF.applyLB(...)}} (with safe lower bounds, so the load
> does not collapse at
> 0.1) to the heaviest tests whose time is dominated by scalable load loops /
> time-boxes:
> || Class || Scaled ||
> | IgniteTxCacheWriteSynchronizationModesMultithreadedTest | load window 10s |
> | TxDeadlockDetectionNoHangsTest / TxDeadlockDetectionTest | 2-min run window
> |
> | IgniteCacheGetRestartTest | TEST_TIME, KEYS |
> | CrossCacheTxRandomOperationsTest | 10s window |
> | SegmentedRingByteBufferTest | 60s producer/consumer windows |
> | TxPartitionCounterStateConsistencyTest | 30s restart windows |
> | IgnitePdsTransactionsHangTest | DURATION (kept > warm-up) |
> | CacheFreeListSelfTest | grow/shrink load (200k → LB 50k) |
> | ConcurrentCheckpointAndUpdateTtlTest | checkpoint loop |
> | IgniteCachePutAllRestartTest | 2-min + 60s put windows |
> Lower bounds keep each scenario meaningful at 0.1 (e.g. ≥20s for
> deadlock-detection,
> ≥10s for tx/restart windows, ≥50k entries for free-list grow/shrink).
> h2. Scope / stopping criterion
> Candidates were stopped at the point where per-test saving drops below ~1
> minute.
> Beyond the classes above, the next tier yields ~50s/test, and the remainder
> is topology- or
> search-bound (no scalable load). Explicitly _not_ changed:
> * {\{GridCommandHandlerConsistencyCountersTest}} — the \{{2_000}} preload is
> a semantic threshold
> ("enough for historical rebalance"), not load.
> * affinity-key search loops mis-detected as load (e.g.
> {\{CacheContinuousQueryAsyncFilterListenerTest}},
> \{{IgniteCacheClientNodeChangingTopologyTest}}).
> * {\{CacheJdbcPojoWriteBehindStoreWithCoalescingTest}} — hang/coalescing
> regression test where
> volume matters; scaling could weaken it.
> h2. Validation
> * {\{mvn test-compile -pl modules/core}} green (JDK 17; note: JDK 21+ fails
> on unrelated
> {\{Thread.suspend/resume}} usages).
> * Per-test durations taken from RunAll build #9159468.
> h2. Follow-up (separate tickets)
> * The remaining ~54% (CDC suites, snapshot-restore) is fan-out/IO-bound —
> needs reduced data
> volume in \{{AbstractCdcTest}} / \{{AbstractSnapshotSelfTest}} or suite
> re-balancing, evaluated
> for coverage impact separately.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)