Hi all,

I'd like to share a recent optimization to the Cluster IT - 1C1D1A (AINode)
pipeline that brought its runtime down from ~52 minutes to ~27 minutes (a
48% reduction).[1]

想和大家分享一下最近对 Cluster IT - 1C1D1A (AINode) pipeline 的一次优化,将其运行时间从 ~52 分钟降至
~27 分钟(缩短 48%)。

---
Goal / 优化目标

The AINode CI was the bottleneck of the overall PR check pipeline, often
running 50+ minutes while other jobs finished in 30 minutes or less.
Profiling showed that 74% of the time in the IT phase was spent on cluster
startup, 22% on PyInstaller packaging the AINode Python binary, and only
about 17% on actual test execution.

AINode CI 一直是 PR 检查 pipeline 的瓶颈,整体经常需要 50+ 分钟才能完成,而其他 job 通常 30
分钟以内。通过分析日志发现,IT 阶段 74% 的时间花在集群启动,22% 花在 PyInstaller 打包 AINode 的 Python
二进制,真正测试执行只占约 17%。

---
Approach / 优化方式

Two targeted changes (PR #17687):

两项针对性改动(PR #17687):

1. Test consolidation — shared cluster across test classes

1. 测试合并 —— 多个测试类共享同一个集群

Previously, each of the 7 AINode IT test classes started its own 1C+1D+1A
cluster, leading to 8 total cluster startups per run (AINodeClusterConfigIT
even started one per @Test method due to @Before/@After). I merged the 5
compatible classes (DeviceManage, ModelManage, CallInference, Forecast,
InstanceManagement) into a single AINodeSharedClusterIT using
@BeforeClass/@AfterClass, so all 15 test methods share one cluster.
AINodeClusterConfigIT was also converted to class-level lifecycle.
AINodeConcurrentForecastIT stayed separate (different data setup, heavy
concurrent load).

原本 7 个 AINode IT 测试类各自启动一套 1C+1D+1A 集群,整个 pipeline 共启动 8
次集群(AINodeClusterConfigIT 由于使用 @Before/@After,甚至每个 @Test 方法启动一次)。我将 5
个相互兼容的测试类(DeviceManage、ModelManage、CallInference、Forecast、InstanceManagement)合并为一个
AINodeSharedClusterIT,使用 @BeforeClass/@AfterClass 让所有 15
个测试方法共享同一个集群。AINodeClusterConfigIT 也改成 class
级别生命周期。AINodeConcurrentForecastIT 因数据准备方式不同且涉及高并发负载,保持独立。

Cluster startups: 8 → 3.

集群启动次数:8 → 3。

2. PyInstaller dist caching — skip rebuild when source unchanged

2. PyInstaller dist 缓存 —— 源码未变时跳过重新打包

PyInstaller's analysis phase scans thousands of hidden imports from
torch/transformers/numpy and takes ~10 minutes per run, even though AINode
source rarely changes between PRs. build_binary.py now computes a SHA256
hash over all relevant source files (Python sources, .spec, pyproject.toml,
poetry.lock, copied client-py sources, plus the Python interpreter version)
and caches the dist/ output at ~/.cache/iotdb-ainode-build/dist-cache/. On
a cache hit, the dist is restored directly and PyInstaller is skipped
entirely.

PyInstaller 的 Analysis 阶段会扫描 torch/transformers/numpy 等成千上万个 hidden
import,每次运行需要约 10 分钟,但 AINode 源码在大多数 PR 之间并不变化。build_binary.py
现在会对所有相关源文件(Python 源码、.spec、pyproject.toml、poetry.lock、复制进来的 client-py
源码,加上 Python 解释器版本)计算 SHA256,并将 dist/ 输出缓存至
~/.cache/iotdb-ainode-build/dist-cache/。命中缓存时直接恢复 dist,整个 PyInstaller 阶段秒过。

This works because the pipeline runs on a self-hosted runner (ci-182) where
/root/.cache/ persists across runs.

由于本 pipeline 跑在 self-hosted runner (ci-182) 上,/root/.cache/ 跨运行持久化,缓存可以在 CI
多次调用之间复用。

---
Results / 优化效果

┌─────────────────────────────────────────────┬───────────────────┬────────────────────┐
│                Phase / 阶段                 │ Original / 优化前 │ Optimized /
优化后 │
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Cluster startups (total) / 集群启动(合计) │ 25 min            │ 11 min
    │
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ PyInstaller packaging / PyInstaller 打包    │ 11 min            │ <1 min
(cache hit) │
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Actual test execution / 实际测试执行        │ 9 min             │ 7 min
     │
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Maven build / overhead / Maven 构建及其他   │ 7 min             │ 8 min
       │
├─────────────────────────────────────────────┼───────────────────┼────────────────────┤
│ Total / 总耗时                              │ ~52 min           │ ~27 min
         │
└─────────────────────────────────────────────┴───────────────────┴────────────────────┘

When AINode source changes (cache miss), the run still benefits from test
consolidation alone and lands around 37 min (-29%).

当 AINode 源码变更时(缓存未命中),仅靠测试合并这一项也能将耗时控制在 ~37 min(-29%)。

Notes / 备注:
- No test was deleted; only the class organization changed. All 18 original
test methods still run.
- 没有删除任何测试,仅做了 class 层面的重组,原有的 18 个测试方法全部保留。
- The cache key includes the Python interpreter version, so interpreter
upgrades invalidate the cache automatically.
- 缓存 key 包含 Python 解释器版本,解释器升级会自动失效缓存。

PR: https://github.com/apache/iotdb/pull/17687

---
What's Next / 下一步预告

With 1C1D1A fixed, the remaining CI bottlenecks (based on recent runs) are:

1C1D1A 优化后,根据近期运行数据,CI 还存在以下瓶颈:

┌─────────────────────────┬───────────────┬─────────────────────────────────────────────────────────┐
│        Workflow         │ Avg. Duration │                     Main
Bottleneck                     │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Cluster IT - 1C1D       │ ~89 min       │ Windows job (106 min) is 2×
slower than Ubuntu (49 min) │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Multi-Cluster IT        │ ~69 min       │
dual-table-manual-basic/enhanced jobs (~64 min each)    │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Table Cluster IT - 1C1D │ ~63 min       │ Same Windows slowness as 1C1D
                        │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Sonar-Codecov           │ ~62 min       │ codecov job 64 min vs sonar
only 8 min                  │
├─────────────────────────┼───────────────┼─────────────────────────────────────────────────────────┤
│ Unit-Test               │ ~51 min       │ Windows datanode job (53 min)
                        │
└─────────────────────────┴───────────────┴─────────────────────────────────────────────────────────┘

┌─────────────────────────┬──────────┬────────────────────────────────────────────────────┐
│        Workflow         │ 平均耗时 │                      主要瓶颈
       │
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Cluster IT - 1C1D       │ ~89 min  │ Windows job (106 min) 比 Ubuntu (49
min) 慢 2 倍   │
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Multi-Cluster IT        │ ~69 min  │ dual-table-manual-basic/enhanced
job(各 ~64 min) │
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Table Cluster IT - 1C1D │ ~63 min  │ 与 1C1D 相同的 Windows 慢问题
       │
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Sonar-Codecov           │ ~62 min  │ codecov job 跑 64 min,但 sonar 仅 8 min
          │
├─────────────────────────┼──────────┼────────────────────────────────────────────────────┤
│ Unit-Test               │ ~51 min  │ Windows 上的 datanode job(53 min)
           │
└─────────────────────────┴──────────┴────────────────────────────────────────────────────┘

Possible directions I'm considering:

正在考虑的优化方向:

- Sharding the Windows IT jobs via matrix to parallelize the slowest
Windows runs across multiple GitHub-hosted VMs.
- 将 Windows IT job 通过 matrix 分片 并行到多个 GitHub 托管 VM 上,缓解 Windows 慢的问题。
- Splitting the codecov job or enabling incremental coverage to bring
Sonar-Codecov down to ~10 min.
- 拆分 codecov job 或启用增量覆盖率,让 Sonar-Codecov 降至 ~10 min。
- Further consolidation in Multi-Cluster IT to reduce the two long-running
dual-table jobs.
- 进一步合并 Multi-Cluster IT 中的测试,减少两个长尾 dual-table job 的耗时。

Suggestions, edge cases, and counter-arguments are all very welcome.

欢迎大家提建议、补充注意事项,或者反对意见。


[1] https://github.com/apache/iotdb/pull/17687

Best regards,

Yuan Tian

Reply via email to