[PR] [FLINK-38286][table] Fix: MAP function with duplicate keys produces non-deterministic results [flink]

via GitHub Tue, 26 Aug 2025 03:47:01 -0700


raminqaf opened a new pull request, #26942:
URL: https://github.com/apache/flink/pull/26942


   ## What is the purpose of the change
   
   This pull request fixes non-deterministic behavior in the MAP function when 
duplicate keys are provided. The MAP function was producing inconsistent 
results across different environments and test runs, causing CI failures and 
breaking reproducibility guarantees.
   
   The root cause was in the code generation logic in 
`ScalarOperatorGens.scala`, where `groupBy` was used to deduplicate keys, but 
the subsequent `.keys` and `.values` extraction had non-deterministic iteration 
order, breaking the correspondence between key and value arrays in the 
generated code.
   
   ## Brief change log
   
   - Added `groupByOrdered` utility method in `GenerateUtils` that uses 
`LinkedHashMap` to preserve insertion order during grouping operations
   - Updated MAP function code generation in `ScalarOperatorGens.scala` to use 
deterministic order-preserving deduplication instead of non-deterministic 
`groupBy`
   - Ensured "last value wins" semantics for duplicate keys by taking the last 
occurrence in argument order
   - Fixed key-value array correspondence in generated code to prevent 
mismatched entries
   
   ## Verifying this change
   
   This change is already covered by existing tests, such as:
   
   - **MapFunctionITCase.test()** - Contains the specific failing test case 
`map(f0, f0, f0, f1)` that was producing non-deterministic results
   - The fix makes the previously flaky test `MAP[1, 1, 1, 2] → {1=2}` 
consistently pass
   - All existing MAP function tests continue to pass with deterministic 
behavior
   - Manual verification shows consistent results across multiple test runs and 
environments
   
   The change specifically addresses test failures that occurred when constant 
folding was disabled, ensuring both code paths (optimized and runtime) produce 
consistent results.
   
   ## Does this pull request potentially affect one of the following parts:
   
   - Dependencies (does it add or upgrade a dependency): **no**
   - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: **no**
   - The serializers: **no**
   - The runtime per-record code paths (performance sensitive): **yes** - 
affects MAP function code generation, but with minimal performance impact (same 
O(n) complexity)
   - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: **no**
   - The S3 file system connector: **no**
   
   ## Documentation
   
   - Does this pull request introduce a new feature? **no**
   - If yes, how is the feature documented? **not applicable** - this is a bug 
fix that


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [FLINK-38286][table] Fix: MAP function with duplicate keys produces non-deterministic results [flink]

Reply via email to