CRZbulabula opened a new pull request, #17924:
URL: https://github.com/apache/iotdb/pull/17924

   ## Background
   
   The nightly **Daily IT** run failed in both `Simple (8)` and `Simple (17)` 
jobs. Root-cause analysis pointed to three flaky tests, all caused by *"result 
visible before the state the test depends on is actually ready"* timing issues 
— not product bugs.
   
   ## Fixes
   
   ### 1. `IoTConsensusV2 3C3D` `testDeleteTimeSeriesReplicaConsistency` 
(Stream & Batch)
   In Step 7 the restarted DataNode was only awaited via `isAlive()` (OS 
process up), not until it could actually serve queries. The next loop iteration 
treated it as a surviving node and connected to it, intermittently hitting 
`Connection refused` (RPC port not yet open / not re-registered). Now we wait 
until the node is genuinely queryable (connect + a trivial query succeeds) 
before proceeding.
   
   ### 2. Region-migration kill-point framework 
(`IoTDBRegionOperationReliabilityITFramework`)
   Kill points are detected by a background thread tailing the node log. In the 
**success** path the test asserted `checkKillPointsAllTriggered` immediately 
after `awaitUntilSuccess`, so the migration result could become visible before 
the tailer processed the kill-point line of the last phase (e.g. 
`RemoveRegionLocationCache`) — failing spuriously with `Some kill points was 
not triggered`. The rollback path already awaited; the success path now does 
too.
   
   ### 3. `IoTDBCustomizedClusterIT.testRepeatedlyRestartWholeClusterWithWrite`
   `SELECT last s1 FROM root.**` is fanned out to every DataNode and compared 
positionally across replicas. Right after a full-cluster restart the last cache 
is reloaded lazily, so the row order could differ across coordinators 
(`InconsistentDataException`). Added `ORDER BY TIMESERIES` to make ordering 
deterministic (the root cause) and wrapped the comparison in a retry to 
tolerate the brief convergence window — without masking a genuine, persistent 
inconsistency.
   
   ## Re-enable IoTV2 region-migration tests
   
   The IoTV2 (batch & stream) region-migration ITs had a number of cases 
commented out with *"reopen this CI after discussion"*. This PR re-enables them 
and moves the **whole IoTV2 region-migration suite** (ConfigNodeCrash / 
ClusterCrash / DataNodeCrash) out of the `DailyIT` category so it runs in the 
**normal PR pipeline**, in order to observe its stability directly here.
   
   - ConfigNodeCrash (batch+stream): enable `testCnCrashDuringDoAddPeer`, 
`cnCrashDuringRemoveRegionLocationCacheTest`, `cnCrashTest` (the `@Ignore`d 
PreCheck case is left untouched).
   - ClusterCrash: enable `clusterCrash2`/`7` (batch) and 
`clusterCrash1`/`2`/`7` (stream).
   - Generic `DataNodeCrash` (batch+stream): enable all 5 cases.
   - Remove `@Category(DailyIT.class)` from the IoTV2 region-migration classes 
so they run per-PR.
   
   > Note: this is intentionally pushed to the PR pipeline to gauge stability; 
if any newly-enabled case proves flaky we can revisit before merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to