Shiyang-Zhao opened a new pull request, #18621:
URL: https://github.com/apache/druid/pull/18621

   This PR fixes nondeterministic behavior in the following flaky tests:
   - 
org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest.testIsTaskCurrent
  
   - 
org.apache.druid.indexing.kinesis.supervisor.KinesisSupervisorTest.testIsTaskCurrent
  
   
   ### Description
   
   These tests occasionally failed due to nondeterministic behavior in how the 
supervisor evaluated whether a task was “current.”  
   The issue originated from unstable iteration order inside the `DataSchema` 
serialization process, which affected equality checks between supervisor and 
task configurations.
   
   ```
   KafkaSupervisorTest.testIsTaskCurrent()
     ↓
   KafkaSupervisor.isTaskCurrent()
     ↓
   SeekableStreamSupervisor.isTaskCurrent()
     ↓
   SeekableStreamSupervisor.generateSequenceName()
     ↓
   ObjectMapper.writeValueAsString(DataSchema)
     ↓
   DataSchema → DimensionsSpec.dimensionExclusions used a HashSet 
(non-deterministic iteration)
   ```
   
   Here is a sample of an error from running 
**org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest.testIsTaskCurrent**:
   
   ```
   [ERROR] 
org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest.testIsTaskCurrent[numThreads
 = 1] -- Time elapsed: 0.79 s <<< FAILURE!
   java.lang.AssertionError
           at org.junit.Assert.fail(Assert.java:87)
           at org.junit.Assert.assertTrue(Assert.java:42)
           at org.junit.Assert.assertTrue(Assert.java:53)
           at 
org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest.testIsTaskCurrent(KafkaSupervisorTest.java:4740)
   
   [ERROR] 
org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest.testIsTaskCurrent[numThreads
 = 8] -- Time elapsed: 0.04 s <<< FAILURE!
   java.lang.AssertionError
           at org.junit.Assert.fail(Assert.java:87)
           at org.junit.Assert.assertTrue(Assert.java:42)
           at org.junit.Assert.assertTrue(Assert.java:53)
           at 
org.apache.druid.indexing.kafka.supervisor.KafkaSupervisorTest.testIsTaskCurrent(KafkaSupervisorTest.java:4741)
   ```
   
   **Problem:**  
   `SeekableStreamSupervisor.isTaskCurrent()` verifies task consistency by 
recomputing a sequence name hash using `generateSequenceName()`, which 
serializes the `DataSchema` object to JSON. Within `DataSchema`, the 
`DimensionsSpec` class maintained its `dimensionExclusions` field as a 
`HashSet`, which does not guarantee element iteration order. Because JSON 
serialization depends on iteration order, identical `DimensionsSpec` instances 
could produce different serialized strings across runs. This caused the 
computed hashes for otherwise equivalent `DataSchema` objects to differ, 
leading `isTaskCurrent()` to incorrectly treat valid tasks as outdated or 
inconsistent. The issue manifested as nondeterministic test outcomes, since the 
JSON field order—and thus the resulting hash—varied between executions.
   
   **Proposed Changes:**  
   - Replaced `HashSet` with `LinkedHashSet` in `DimensionsSpec` constructor to 
ensure deterministic iteration order.  
   - Ensured `DataSchema` serialization and equality checks are now stable and 
repeatable.
   
   Here is a sample of an error from running 
**org.apache.druid.indexing.kinesis.supervisor.KinesisSupervisorTest.testIsTaskCurrent**:
   
   ```
   [ERROR] 
org.apache.druid.indexing.kinesis.supervisor.KinesisSupervisorTest.testIsTaskCurrent[numThreads
 = 1] -- Time elapsed: 0.84 s <<< FAILURE!
   java.lang.AssertionError
           at org.junit.Assert.fail(Assert.java:87)
           at org.junit.Assert.assertTrue(Assert.java:42)
           at org.junit.Assert.assertTrue(Assert.java:53)
           at 
org.apache.druid.indexing.kinesis.supervisor.KinesisSupervisorTest.testIsTaskCurrent(KinesisSupervisorTest.java:4821)
   ```
   
   **Problem:**  
   The same serialization nondeterminism affected Kinesis supervisor tests, 
since `KinesisSupervisor` extends the same  
   `SeekableStreamSupervisor` base class and reuses the same `isTaskCurrent()` 
logic.  
   
   **Proposed Changes:**  
   - No Kinesis-specific changes were required; the fix in `DimensionsSpec` 
resolved both cases. 
   
   ---
   
   This PR has:
   
   - [x] been self-reviewed.  
   - [x] ensured no production logic changes beyond test stabilization.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to