mbutrovich opened a new pull request, #21517:
URL: https://github.com/apache/datafusion/pull/21517

   ## Which issue does this PR close?
   
   Partially addresses #20910, might be the last one for now.
   
   ## Rationale for this change
   In full outer joins with filters, `BufferedBatch` tracks which buffered rows 
had all filter evaluations fail using a `HashMap<u64, bool>`. This map is read 
and written per-row in a hot loop during `freeze_streamed_matched`. The HashMap 
pays ~40-60 bytes per entry (8-byte u64 key + 1-byte bool value + hash table 
overhead), hashes every key twice per iteration (once for `get`, once for 
`insert`), and scatters entries across heap allocations with poor cache 
locality.
   
   ## What changes are included in this PR?
   
   Replaces `HashMap<u64, bool>` with `Vec<FilterState>` indexed by absolute 
row position within the batch. `FilterState` is a `#[repr(u8)]` enum with three 
variants (`Unvisited`, `AllFailed`, `SomePassed`), so the Vec is 1 byte per row 
— allocated once, direct-indexed, no hashing. At the default batch size of 8192 
rows the Vec is 8 KB (fits in L1 cache). Even at large batch sizes (32K+), 32 
KB is still within L1 on most machines, while the HashMap at 32K entries would 
consume ~1-2 MB of scattered heap memory.
   
   Three states are needed because a simple `Vec<bool>` cannot distinguish 
"never matched" (handled separately by `null_joined`) from "matched but all 
filters failed" (must be emitted as null-joined). The enum variant names are 
self-documenting, unlike `Option<bool>` where `None`/`Some(true)`/`Some(false)` 
would be opaque.
   
   ## Are these changes tested?
   Existing tests.
   
   ## Are there any user-facing changes?
   No. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to