Hi all, 刚开了一个 PR 来加速 Multi-Cluster IT 流水线,在此把背景、做法和实测结果 同步给社区,欢迎 review:https://github.com/apache/iotdb/pull/17695
I just opened a PR to speed up the Multi-Cluster IT pipeline. Sharing the background, approach, and measured results below — reviews welcome: https://github.com/apache/iotdb/pull/17695 == 背景 / Background == `Multi-Cluster IT` 每个 PR 跑 11 个并行 job。其中 5 个 dual-cluster job 都用 HighPerformanceMode(每个测试要起 2 个集群 × 4 节点 = 8 节点),单 job 最长 ~63 分钟,几乎独占整个 workflow 的 wall clock。其余 6 个 job 都在 ~5-8 分钟内完成。 `Multi-Cluster IT` runs 11 parallel jobs per PR. Five of them are dual-cluster jobs on HighPerformanceMode (2 clusters × 4 nodes = 8 nodes per test) — the longest one took ~63 min and almost single- handedly dictated the workflow's wall clock. The other 6 jobs all finished in ~5-8 min. == 做法 / Approach == 复用了之前 PR #17692 在 cluster-it-1c1d.yml 引入的分片模式:给这 5 个 dual job 各加一个 `shard: [0, 1, 2]` 矩阵维度,按 hash-mod 把 IT 类列表 分到 3 个并行 shard,写到 `$RUNNER_TEMP/it-shard.txt` 后通过 `-Dfailsafe.includesFile` 传给 failsafe。 Reused the sharding pattern that PR #17692 introduced in cluster-it-1c1d.yml: added a `shard: [0, 1, 2]` matrix dimension to each of the 5 dual jobs, hash-mod'd the IT class list into 3 parallel shards, wrote each shard's list to `$RUNNER_TEMP/it-shard.txt`, and passed it to failsafe via `-Dfailsafe.includesFile`. 只改了 `.github/workflows/pipe-it.yml`,没有动任何测试代码(+110 / -5 行)。 Only `.github/workflows/pipe-it.yml` was touched, no test code changed (+110 / -5 lines). == 实测结果 / Measured results == 整个 Multi-Cluster IT workflow wall clock:~63 分钟 → ~33 分钟 (约 1.9× 加速,每个 PR 省 ~30 分钟)。 Multi-Cluster IT workflow wall clock: ~63 min → ~33 min (~1.9× speedup, ~30 min saved per PR). 各 job 实测对比 / Per-job measurements: Job Before After Speedup -------------------------- ------ ----- ------- dual-table-manual-basic ~63 min ~33 min 1.9× dual-table-manual-enhanced ~62 min ~31 min 2.0× dual-tree-auto-enhanced ~51 min ~33 min 1.5× dual-tree-auto-basic ~42 min ~25 min 1.7× dual-tree-manual ~27 min ~15 min 1.8× 5 个 job 的 15 个 shard 全部 pass,没有触发 RAT 的 "Files with unapproved licenses" 警告(shard 文件写在 `$RUNNER_TEMP` 下,仓库外),各 shard 的 类数与本地预演一致(4/4/4、3/3/3、3/4/4、4/5/4、3/4/4)。 All 15 shards across the 5 jobs passed. No `Files with unapproved licenses` warning from RAT (the shard file lives under `$RUNNER_TEMP`, outside the repo). Per-shard class counts on CI matched the local preview exactly (4/4/4, 3/3/3, 3/4/4, 4/5/4, 3/4/4). == 关于实际加速不到 3× 的说明 / Note on actual vs. theoretical speedup == 理论上 3 个并行 shard 应该带来 ~3× 加速,但实测只有 ~1.9×。原因是按类名 字母序 hash-mod 不能按类的耗时做均衡——某个 shard 总会拿到那些重量级的类。 例如 `dual-tree-auto-basic` 的 shard 0 跑了 25 分钟,shard 2 只跑了 9 分钟, 而 wall clock 由最慢的 shard 决定。 A 3-shard split should ideally give a ~3× speedup, but we measured ~1.9×. The reason is that alphabetical hash-mod doesn't balance by per-class cost — some shard always lands the heavyweight classes. For example, `dual-tree-auto-basic` shard 0 took 25 min while shard 2 took only 9 min, and the wall clock is bounded by the slowest shard. 要想进一步压缩,后续可以考虑: - 按历史耗时对类做加权 bin-pack(而不是按类名 hash) - 增加 shard 数量(4 或 5)以降低方差 - 在每个 shard 内做集群复用(参考 PR #17687 的 AINodeSharedClusterIT 思路) To squeeze further, follow-ups could: - cost-weighted bin-packing by historical per-class duration (instead of name-based hash) - more shards per job (4 or 5) to reduce variance - cluster reuse within a shard (cf. AINodeSharedClusterIT from PR #17687) 但这些都会显著增加 runner 占用或测试代码改动,本 PR 先用最低风险的方式 拿到 ~2× 加速。 But each of those significantly increases runner usage or requires test-code changes, so this PR takes the lowest-risk path for a ~2× win. 欢迎 review。 Reviews welcome. Best regards, ---------------- Yuan Tian
