Hi all, Following up on the AINode CI speedup shared earlier this month, I have another round of CI improvements ready in PR #17692 ( https://github.com/apache/iotdb/pull/17692).
继上次分享 AINode CI 优化之后,本月又有一波 CI 加速进展想和大家同步,详见 PR #17692 ( https://github.com/apache/iotdb/pull/17692)。 --- Goal / 目标 Two PR-check pipelines — Cluster IT - 1C1D and Table Cluster IT - 1C1D — were dragged down by their Windows runners, which are 67–77% slower than Ubuntu for the same workload. Pipeline wall clock is max(Ubuntu, Windows), so even though Ubuntu was already fast (~49 / ~39 min), Windows pulled the totals up to ~87 / ~65 min. Cluster IT - 1C1D 和 Table Cluster IT - 1C1D 这两条 PR 检查流水线一直被 Windows runner 拖慢——Windows 跑同样的工作量比 Ubuntu 慢 67–77%。流水线总耗时 = max(Ubuntu, Windows),所以即便 Ubuntu 已经很快(~49 / ~39 分钟),Windows 也会把总数拉到 ~87 / ~65 分钟。 --- Approach / 方案 Split each Windows job into 3 parallel matrix shards. Each shard runs ~1/3 of the IT classes selected by category annotation (LocalStandaloneIT / TableLocalStandaloneIT), distributed by hash-mod. Ubuntu stays as a single job — it wasn't the bottleneck, so sharding it would just add scheduling overhead. 把 Windows job 切成 3 个并行的 matrix shard。每个 shard 通过类别注解(LocalStandaloneIT / TableLocalStandaloneIT)挑出约 1/3 的 IT 类,按 hash-mod 分配。Ubuntu 保持单 job——它不是瓶颈,切了反而徒增 matrix 调度开销。 The shard list is generated at runtime into $RUNNER_TEMP/it-shard.txt and consumed via Maven's -Dfailsafe.includesFile. This avoids Windows' command-line length limit and stays robust as the test suite grows. shard 列表在运行时写到 $RUNNER_TEMP/it-shard.txt,通过 Maven 的 -Dfailsafe.includesFile 读入。这样能绕开 Windows 命令行长度上限,后续测试套件再扩也不用改方案。 --- Results / 效果 ┌─────────────────────────┬─────────┬─────────┬───────┐ │ Pipeline │ Before │ After │ Saved │ ├─────────────────────────┼─────────┼─────────┼───────┤ │ Cluster IT - 1C1D │ ~87 min │ ~48 min │ −45% │ ├─────────────────────────┼─────────┼─────────┼───────┤ │ Table Cluster IT - 1C1D │ ~65 min │ ~40 min │ −38% │ └─────────────────────────┴─────────┴─────────┴───────┘ Both pipelines are now capped by Ubuntu — Windows shards finish 10–16 min ahead. 3-way sharding is the sweet spot; going to 4 or 5 shards would only add matrix scheduling cost without reducing wall clock. 两条流水线现在都被 Ubuntu 卡住——Windows shard 比 Ubuntu 早完成 10–16 分钟。3 路分片刚好是甜点;继续切到 4 路或 5 路只会增加 matrix 调度开销,墙钟也降不下来,除非接下来去优化 Ubuntu 那一侧。 --- Pitfalls worth sharing / 踩过的坑 For anyone applying a similar approach elsewhere, two non-obvious bugs came up during this work: 如果有同学要在别处用类似方案,有两个不太显眼的坑值得分享: 1. find ... | xargs -0 grep -l exits 123 on Windows Git Bash. Windows has a much smaller ARG_MAX than Linux, so xargs batches the file list. Any batch with zero matches makes grep return 1 → xargs returns 123 → set -o pipefail fails the step. Fix: use a single grep -rl --include=... call instead. 1. find ... | xargs -0 grep -l 在 Windows Git Bash 下退出码 123。 Windows 的 ARG_MAX 比 Linux 小得多,xargs 会把文件列表切成多批传给 grep。只要某一批没有匹配,grep 就返回 1,xargs 整体返回 123,set -o pipefail 把整步骤判失败。修复:用一次性的 grep -rl --include=...。 2. Apache RAT flags any generated file inside the repo. Our integration-test/it-shard.txt got reported as "unapproved license". target/ is RAT-excluded but mvn clean would wipe the file before it's read. Fix: write to $RUNNER_TEMP/it-shard.txt, which lives outside the repo entirely. 2. Apache RAT 会扫到仓库里的任何生成文件。 我们最初写到 integration-test/it-shard.txt 被 RAT 报 "unapproved license"。target/ 虽然在 RAT 排除列表里,但 mvn clean 会先把它清掉。修复:写到 $RUNNER_TEMP/it-shard.txt,这是 runner 临时目录,完全在仓库外。 --- What's next / 后续可继续优化的瓶颈 A preview of where I'd like to look in future rounds: 后续想继续看的几条流水线: - Unit-Test on Windows (~47 min, ~23% slower than Ubuntu) — gap is smaller, but if anyone knows why Surefire runs slower on Windows, happy to chat. - Multi-Cluster IT — not yet profiled; suspect cluster startup overhead. - AINode cold build still costs ~7 min when the PyInstaller cache misses; worth widening the cache hit rate across PRs/branches. - Unit-Test 的 Windows runner(~47 分钟,比 Ubuntu 慢 ~23%)——差距比 IT 流水线小,但如果有同学了解 Surefire 在 Windows 上为什么变慢,欢迎一起讨论。 - Multi-Cluster IT ——还没做过 profiling,怀疑是集群启动开销。 - AINode cold build 在 PyInstaller 缓存 miss 时还要 ~7 分钟,看能不能让缓存在更多 PR/branch 之间复用。 Reviews on the PR are very welcome. Happy to walk through any of the details online if useful. 欢迎 review PR。 Best regards, Yuan Tian
