(iotdb) 01/01: Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT

jackietien Sat, 16 May 2026 18:51:44 -0700

This is an automated email from the ASF dual-hosted git repository.

JackieTien97 pushed a commit to branch shard-subscription-consumer-it
in repository https://gitbox.apache.org/repos/asf/iotdb.git


commit b996d09e21656fcd801ea0f8f04f5b332c556b06
Author: JackieTien97 <[email protected]>
AuthorDate: Sun May 17 09:51:15 2026 +0800

    Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT
    
    The Multi-Cluster IT pipeline (pipe-it.yml) runs 11 parallel jobs on every
    PR. Of those, subscription-tree-regression-consumer is the longest pole:
    72 IT classes annotated with
    @Category(MultiClusterIT2SubscriptionTreeRegressionConsumer.class), each
    restarting two ScalableSingleNodeMode clusters in setUp(), executed
    serially in a single forkCount=1 JVM. Estimated wall clock ~30-45 min,
    while every other job in the workflow finishes in ~10-20 min.
    
    Split this job into 3 parallel matrix shards using the same hash-mod
    pattern that cluster-it-1c1d.yml introduced (commits 89748f1ff6,
    a343cf50e3, 02ef20af29). Each shard runs ~24 of the 72 classes and is
    expected to finish in ~12-18 min, removing this job as the workflow's
    bottleneck. The shard list is written to \$RUNNER_TEMP/it-shard.txt for
    the same RAT-avoidance reason as 1C1D.
    
    Two deviations from the 1C1D pattern:
    
    1. The shard list emits paths relative to src/test/java/ (e.g.,
       org/apache/iotdb/.../IoTDBFooIT.java) instead of bare class names.
       This suite has 6 pairs of duplicate simple names across
       pushconsumer/multi/ and pullconsumer/multi/ (e.g.,
       IoTDBOneConsumerMultiTopicsTsfileIT exists in both). Bare names would
       cause failsafe to match both files for each entry, running those 6
       classes twice across shards.
    
    2. The other subscription / dual-cluster jobs in this workflow are not
       sharded. subscription-tree-regression-misc (13 classes) is borderline;
       arch-verification jobs (1-4 classes each) and dual-tree/dual-table
       jobs (9-13 classes) are well under the new shard wall clock and would
       not benefit. Revisit if any of them becomes the new long pole.
    
    Local counts on macOS:
    - Total classes matching the annotation: 72
    - Per-shard distribution after hash-mod: 24/24/24
    - Unique paths after sed normalization: 72 (no collisions)
---
 .github/workflows/pipe-it.yml | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/pipe-it.yml b/.github/workflows/pipe-it.yml
index 0968e7739a0..29bbf365e42 100644
--- a/.github/workflows/pipe-it.yml
+++ b/.github/workflows/pipe-it.yml
@@ -548,6 +548,8 @@ jobs:
           name: cluster-log-subscription-table-arch-verification-java${{ 
matrix.java }}-${{ runner.os }}-${{ matrix.cluster1 }}-${{ matrix.cluster2 }}
           path: integration-test/target/cluster-logs
           retention-days: 30
+  # 72 IT classes split across 3 parallel shards to cut the longest-pole job
+  # from ~30-45 min to ~12-18 min. See cluster-it-1c1d.yml for the prior art.
   subscription-tree-regression-consumer:
     strategy:
       fail-fast: false
@@ -558,6 +560,7 @@ jobs:
         cluster1: [ScalableSingleNodeMode]
         cluster2: [ScalableSingleNodeMode]
         os: [ubuntu-latest]
+        shard: [0, 1, 2]
     runs-on: ${{ matrix.os }}
     steps:
       - uses: actions/checkout@v5
@@ -577,6 +580,29 @@ jobs:
       - name: Sleep for a random duration between 0 and 10000 milliseconds
         run: |
           sleep  $(( $(( RANDOM % 10000 + 1 )) / 1000))
+      - name: Build IT shard list
+        shell: bash
+        # Distribute MultiClusterIT2SubscriptionTreeRegressionConsumer test 
classes
+        # across 3 shards using hash-mod assignment. The list is written under
+        # $RUNNER_TEMP (outside the repo) so Apache RAT's license check does 
not
+        # flag it - see cluster-it-1c1d.yml, which uses the same path for the
+        # same reason. Each runner has its own $RUNNER_TEMP, so this workflow
+        # and the 1C1D one writing to the same filename never collide.
+        # We emit paths relative to src/test/java/ (not bare class names like
+        # cluster-it-1c1d.yml does) because this suite has 6 pairs of duplicate
+        # simple names across pushconsumer/multi/ and pullconsumer/multi/ - 
bare
+        # names would cause those classes to run twice across shards.
+        run: |
+          set -euo pipefail
+          SHARD=${{ matrix.shard }}
+          TOTAL=3
+          grep -rlE --include='*IT.java' 
'\bMultiClusterIT2SubscriptionTreeRegressionConsumer\b' 
integration-test/src/test/java \
+            | sed 's|.*/src/test/java/||' \
+            | sort \
+            | awk -v s=$SHARD -v t=$TOTAL 'NR%t==s' \
+            > "$RUNNER_TEMP/it-shard.txt"
+          echo "Shard $SHARD/$TOTAL contains $(wc -l < 
"$RUNNER_TEMP/it-shard.txt") test classes"
+          head -5 "$RUNNER_TEMP/it-shard.txt"
       - name: IT Test
         shell: bash
         # we do not compile client-cpp for saving time, it is tested in 
client.yml
@@ -594,12 +620,15 @@ jobs:
               -DskipUTs \
               -DintegrationTest.forkCount=1 -DConfigNodeMaxHeapSize=256 
-DDataNodeMaxHeapSize=1024 -DDataNodeMaxDirectMemorySize=768 \
               -DClusterConfigurations=${{ matrix.cluster1 }},${{ 
matrix.cluster2 }} \
+              -Dfailsafe.includesFile="$RUNNER_TEMP/it-shard.txt" \
+              -DfailIfNoTests=false \
+              -Dfailsafe.failIfNoSpecifiedTests=false \
               -pl integration-test \
               -am -PMultiClusterIT2SubscriptionTreeRegressionConsumer \
               -ntp >> ~/run-tests-$attempt.log && return 0
-              test_output=$(cat ~/run-tests-$attempt.log) 
+              test_output=$(cat ~/run-tests-$attempt.log)
 
-              echo "==================== BEGIN: ~/run-tests-$attempt.log 
===================="          
+              echo "==================== BEGIN: ~/run-tests-$attempt.log 
===================="
               echo "$test_output"
               echo "==================== END: ~/run-tests-$attempt.log 
======================"
 
@@ -631,7 +660,7 @@ jobs:
         if: failure()
         uses: actions/upload-artifact@v6
         with:
-          name: cluster-log-subscription-tree-regression-consumer-java${{ 
matrix.java }}-${{ runner.os }}-${{ matrix.cluster1 }}-${{ matrix.cluster2 }}
+          name: cluster-log-subscription-tree-regression-consumer-shard${{ 
matrix.shard }}-java${{ matrix.java }}-${{ runner.os }}-${{ matrix.cluster1 
}}-${{ matrix.cluster2 }}
           path: integration-test/target/cluster-logs
           retention-days: 30
   subscription-tree-regression-misc:

(iotdb) 01/01: Shard subscription-tree-regression-consumer to speed up Multi-Cluster IT

Reply via email to