CRZbulabula opened a new pull request, #17924: URL: https://github.com/apache/iotdb/pull/17924
## Background The nightly **Daily IT** run failed in both `Simple (8)` and `Simple (17)` jobs. Root-cause analysis pointed to three flaky tests, all caused by *"result visible before the state the test depends on is actually ready"* timing issues — not product bugs. ## Fixes ### 1. `IoTConsensusV2 3C3D` `testDeleteTimeSeriesReplicaConsistency` (Stream & Batch) In Step 7 the restarted DataNode was only awaited via `isAlive()` (OS process up), not until it could actually serve queries. The next loop iteration treated it as a surviving node and connected to it, intermittently hitting `Connection refused` (RPC port not yet open / not re-registered). Now we wait until the node is genuinely queryable (connect + a trivial query succeeds) before proceeding. ### 2. Region-migration kill-point framework (`IoTDBRegionOperationReliabilityITFramework`) Kill points are detected by a background thread tailing the node log. In the **success** path the test asserted `checkKillPointsAllTriggered` immediately after `awaitUntilSuccess`, so the migration result could become visible before the tailer processed the kill-point line of the last phase (e.g. `RemoveRegionLocationCache`) — failing spuriously with `Some kill points was not triggered`. The rollback path already awaited; the success path now does too. ### 3. `IoTDBCustomizedClusterIT.testRepeatedlyRestartWholeClusterWithWrite` `SELECT last s1 FROM root.**` is fanned out to every DataNode and compared positionally across replicas. Right after a full-cluster restart the last cache is reloaded lazily, so the row order could differ across coordinators (`InconsistentDataException`). Added `ORDER BY TIMESERIES` to make ordering deterministic (the root cause) and wrapped the comparison in a retry to tolerate the brief convergence window — without masking a genuine, persistent inconsistency. ## Re-enable IoTV2 region-migration tests The IoTV2 (batch & stream) region-migration ITs had a number of cases commented out with *"reopen this CI after discussion"*. This PR re-enables them and moves the **whole IoTV2 region-migration suite** (ConfigNodeCrash / ClusterCrash / DataNodeCrash) out of the `DailyIT` category so it runs in the **normal PR pipeline**, in order to observe its stability directly here. - ConfigNodeCrash (batch+stream): enable `testCnCrashDuringDoAddPeer`, `cnCrashDuringRemoveRegionLocationCacheTest`, `cnCrashTest` (the `@Ignore`d PreCheck case is left untouched). - ClusterCrash: enable `clusterCrash2`/`7` (batch) and `clusterCrash1`/`2`/`7` (stream). - Generic `DataNodeCrash` (batch+stream): enable all 5 cases. - Remove `@Category(DailyIT.class)` from the IoTV2 region-migration classes so they run per-PR. > Note: this is intentionally pushed to the PR pipeline to gauge stability; if any newly-enabled case proves flaky we can revisit before merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
