kosiew opened a new pull request, #17286:
URL: https://github.com/apache/datafusion/pull/17286
## Which issue does this PR close?
* Closes #17280.
## Rationale for this change
The accumulator previously collected build-side partition bounds and then
**sorted** them with `sorted_by_key`, which:
* Introduced **extra allocations** and
* Added **O(n log n)** overhead on the number of completed partitions.
Since partitions already have stable IDs, we can **pre-index** bounds by
partition ID and avoid sorting entirely. This makes dynamic filter construction
**O(n)** with fewer allocations, improves predictability, and eliminates a
source of nondeterminism tied to completion order.
## What changes are included in this PR?
* Replaced `PartitionBounds` + `sorted_by_key` with a **preallocated
`Vec<Option<Vec<ColumnBounds>>>`** indexed by partition ID.
* Eliminated sorting and the dependency on `itertools`, reducing allocations
and algorithmic overhead.
* Updated accumulator logic to:
* **Bounds insertion in O(1)** at the correct index (by partition ID).
* Validate out-of-range partition IDs and return a clear internal error
instead of panicking.
* Build the dynamic filter once **all partitions have reported**, ignoring
missing partitions.
* Adjusted `create_filter_from_partition_bounds` to iterate the fixed-index
vector and construct predicates without any intermediate sorting/allocation.
* Kept/clarified determinism as a by-product: completion order no longer
affects the resulting predicate.
## Are these changes tested?
Yes.
* Added an async test `test_hashjoin_dynamic_filter_pushdown_out_of_order`
that intentionally reverses completion order of build-side partitions across
runs and asserts the resulting dynamic filter predicate string is identical,
proving order independence while validating logic.
* Existing join and dynamic filter tests continue to pass.
## Are there any user-facing changes?
No API-breaking changes.
* Internals of dynamic filter construction were optimized for efficiency and
determinism.
* Query semantics remain unchanged, but performance improves due to reduced
allocations and removal of sorting overhead.
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]