mbutrovich opened a new pull request, #21517: URL: https://github.com/apache/datafusion/pull/21517
## Which issue does this PR close? Partially addresses #20910, might be the last one for now. ## Rationale for this change In full outer joins with filters, `BufferedBatch` tracks which buffered rows had all filter evaluations fail using a `HashMap<u64, bool>`. This map is read and written per-row in a hot loop during `freeze_streamed_matched`. The HashMap pays ~40-60 bytes per entry (8-byte u64 key + 1-byte bool value + hash table overhead), hashes every key twice per iteration (once for `get`, once for `insert`), and scatters entries across heap allocations with poor cache locality. ## What changes are included in this PR? Replaces `HashMap<u64, bool>` with `Vec<FilterState>` indexed by absolute row position within the batch. `FilterState` is a `#[repr(u8)]` enum with three variants (`Unvisited`, `AllFailed`, `SomePassed`), so the Vec is 1 byte per row — allocated once, direct-indexed, no hashing. At the default batch size of 8192 rows the Vec is 8 KB (fits in L1 cache). Even at large batch sizes (32K+), 32 KB is still within L1 on most machines, while the HashMap at 32K entries would consume ~1-2 MB of scattered heap memory. Three states are needed because a simple `Vec<bool>` cannot distinguish "never matched" (handled separately by `null_joined`) from "matched but all filters failed" (must be emitted as null-joined). The enum variant names are self-documenting, unlike `Option<bool>` where `None`/`Some(true)`/`Some(false)` would be opaque. ## Are these changes tested? Existing tests. ## Are there any user-facing changes? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
