This is an automated email from the ASF dual-hosted git repository. JackieTien97 pushed a commit to branch shard-subscription-consumer-it in repository https://gitbox.apache.org/repos/asf/iotdb.git
commit b996d09e21656fcd801ea0f8f04f5b332c556b06 Author: JackieTien97 <[email protected]> AuthorDate: Sun May 17 09:51:15 2026 +0800 Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT The Multi-Cluster IT pipeline (pipe-it.yml) runs 11 parallel jobs on every PR. Of those, subscription-tree-regression-consumer is the longest pole: 72 IT classes annotated with @Category(MultiClusterIT2SubscriptionTreeRegressionConsumer.class), each restarting two ScalableSingleNodeMode clusters in setUp(), executed serially in a single forkCount=1 JVM. Estimated wall clock ~30-45 min, while every other job in the workflow finishes in ~10-20 min. Split this job into 3 parallel matrix shards using the same hash-mod pattern that cluster-it-1c1d.yml introduced (commits 89748f1ff6, a343cf50e3, 02ef20af29). Each shard runs ~24 of the 72 classes and is expected to finish in ~12-18 min, removing this job as the workflow's bottleneck. The shard list is written to \$RUNNER_TEMP/it-shard.txt for the same RAT-avoidance reason as 1C1D. Two deviations from the 1C1D pattern: 1. The shard list emits paths relative to src/test/java/ (e.g., org/apache/iotdb/.../IoTDBFooIT.java) instead of bare class names. This suite has 6 pairs of duplicate simple names across pushconsumer/multi/ and pullconsumer/multi/ (e.g., IoTDBOneConsumerMultiTopicsTsfileIT exists in both). Bare names would cause failsafe to match both files for each entry, running those 6 classes twice across shards. 2. The other subscription / dual-cluster jobs in this workflow are not sharded. subscription-tree-regression-misc (13 classes) is borderline; arch-verification jobs (1-4 classes each) and dual-tree/dual-table jobs (9-13 classes) are well under the new shard wall clock and would not benefit. Revisit if any of them becomes the new long pole. Local counts on macOS: - Total classes matching the annotation: 72 - Per-shard distribution after hash-mod: 24/24/24 - Unique paths after sed normalization: 72 (no collisions) --- .github/workflows/pipe-it.yml | 35 ++++++++++++++++++++++++++++++++--- 1 file changed, 32 insertions(+), 3 deletions(-) diff --git a/.github/workflows/pipe-it.yml b/.github/workflows/pipe-it.yml index 0968e7739a0..29bbf365e42 100644 --- a/.github/workflows/pipe-it.yml +++ b/.github/workflows/pipe-it.yml @@ -548,6 +548,8 @@ jobs: name: cluster-log-subscription-table-arch-verification-java${{ matrix.java }}-${{ runner.os }}-${{ matrix.cluster1 }}-${{ matrix.cluster2 }} path: integration-test/target/cluster-logs retention-days: 30 + # 72 IT classes split across 3 parallel shards to cut the longest-pole job + # from ~30-45 min to ~12-18 min. See cluster-it-1c1d.yml for the prior art. subscription-tree-regression-consumer: strategy: fail-fast: false @@ -558,6 +560,7 @@ jobs: cluster1: [ScalableSingleNodeMode] cluster2: [ScalableSingleNodeMode] os: [ubuntu-latest] + shard: [0, 1, 2] runs-on: ${{ matrix.os }} steps: - uses: actions/checkout@v5 @@ -577,6 +580,29 @@ jobs: - name: Sleep for a random duration between 0 and 10000 milliseconds run: | sleep $(( $(( RANDOM % 10000 + 1 )) / 1000)) + - name: Build IT shard list + shell: bash + # Distribute MultiClusterIT2SubscriptionTreeRegressionConsumer test classes + # across 3 shards using hash-mod assignment. The list is written under + # $RUNNER_TEMP (outside the repo) so Apache RAT's license check does not + # flag it - see cluster-it-1c1d.yml, which uses the same path for the + # same reason. Each runner has its own $RUNNER_TEMP, so this workflow + # and the 1C1D one writing to the same filename never collide. + # We emit paths relative to src/test/java/ (not bare class names like + # cluster-it-1c1d.yml does) because this suite has 6 pairs of duplicate + # simple names across pushconsumer/multi/ and pullconsumer/multi/ - bare + # names would cause those classes to run twice across shards. + run: | + set -euo pipefail + SHARD=${{ matrix.shard }} + TOTAL=3 + grep -rlE --include='*IT.java' '\bMultiClusterIT2SubscriptionTreeRegressionConsumer\b' integration-test/src/test/java \ + | sed 's|.*/src/test/java/||' \ + | sort \ + | awk -v s=$SHARD -v t=$TOTAL 'NR%t==s' \ + > "$RUNNER_TEMP/it-shard.txt" + echo "Shard $SHARD/$TOTAL contains $(wc -l < "$RUNNER_TEMP/it-shard.txt") test classes" + head -5 "$RUNNER_TEMP/it-shard.txt" - name: IT Test shell: bash # we do not compile client-cpp for saving time, it is tested in client.yml @@ -594,12 +620,15 @@ jobs: -DskipUTs \ -DintegrationTest.forkCount=1 -DConfigNodeMaxHeapSize=256 -DDataNodeMaxHeapSize=1024 -DDataNodeMaxDirectMemorySize=768 \ -DClusterConfigurations=${{ matrix.cluster1 }},${{ matrix.cluster2 }} \ + -Dfailsafe.includesFile="$RUNNER_TEMP/it-shard.txt" \ + -DfailIfNoTests=false \ + -Dfailsafe.failIfNoSpecifiedTests=false \ -pl integration-test \ -am -PMultiClusterIT2SubscriptionTreeRegressionConsumer \ -ntp >> ~/run-tests-$attempt.log && return 0 - test_output=$(cat ~/run-tests-$attempt.log) + test_output=$(cat ~/run-tests-$attempt.log) - echo "==================== BEGIN: ~/run-tests-$attempt.log ====================" + echo "==================== BEGIN: ~/run-tests-$attempt.log ====================" echo "$test_output" echo "==================== END: ~/run-tests-$attempt.log ======================" @@ -631,7 +660,7 @@ jobs: if: failure() uses: actions/upload-artifact@v6 with: - name: cluster-log-subscription-tree-regression-consumer-java${{ matrix.java }}-${{ runner.os }}-${{ matrix.cluster1 }}-${{ matrix.cluster2 }} + name: cluster-log-subscription-tree-regression-consumer-shard${{ matrix.shard }}-java${{ matrix.java }}-${{ runner.os }}-${{ matrix.cluster1 }}-${{ matrix.cluster2 }} path: integration-test/target/cluster-logs retention-days: 30 subscription-tree-regression-misc:
