Below0 opened a new pull request, #15740: URL: https://github.com/apache/iceberg/pull/15740
## Problem `HashKeyGenerator.SelectorKey` was missing `writeParallelism` and `distributionMode` from its `equals()` and `hashCode()` methods. As a result, `computeIfAbsent` always hit the cache after the first record for a given table, silently reusing a stale `KeySelector` even when these values changed. This contradicts the class-level Javadoc which states: > "Caching ensures that a new key selector is also created when … the user-provided metadata changes (e.g. distribution mode, write parallelism)." ## Fix Add `writeParallelism` and `distributionMode` to `SelectorKey`'s fields, `equals()`, `hashCode()`, and `toString()`. The effective values passed to the cache key match those used in the `computeIfAbsent` lambda — `distributionMode` normalized via `firstNonNull(..., NONE)` and `writeParallelism` capped at `maxWriteParallelism`. ## Note `writeParallelism` and `distributionMode` should remain stable per table during a streaming job. Changing these values mid-stream — especially when equality fields are set — can cause routing changes that break equality delete co-location, as the subtask assignment is not monotonic across different `writeParallelism` values (i.e., the subtask set for parallelism N is not guaranteed to be a subset of the set for parallelism N+1). Making the subtask assignment monotonic (e.g., via a consistent ordering based on `maxWriteParallelism`) could address this limitation in a follow-up. ## Testing Added two regression tests to `TestHashKeyGenerator`: - `testCacheMissOnWriteParallelismChange` - `testCacheMissOnDistributionModeChange` Closes #15731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
