Dandandan opened a new pull request, #22677:
URL: https://github.com/apache/datafusion/pull/22677

   ## Which issue does this PR close?
   
   - N/A (hash join probe-side micro-optimization).
   
   ## Rationale for this change
   
   In the hash join probe path, 
`JoinHashMap::get_matched_indices_with_limit_offset` handles the non-unique-key 
(chained) case by, for each probe row, looking up the head of the collision 
chain (`map.find`) and immediately walking that chain. Because the `map.find` 
result feeds straight into the chain walk that consumes it, the hash-table 
probes are effectively serialized: each row's hash-table cache miss must 
resolve before the next row's lookup begins.
   
   The `map.find` miss is the dominant cost in this path, and these lookups are 
independent across probe rows — an ideal candidate for memory-level parallelism.
   
   ## What changes are included in this PR?
   
   Process the probe rows in windows of 16 with two phases per window:
   
   1. **Lookup phase** — resolve the head-of-chain index (`map.find`) for every 
row in the window into a small stack array. These probes are independent, so 
their cache misses overlap (several outstanding at once) instead of each 
stalling the next row.
   2. **Traversal phase** — walk the chains in probe-row order, exactly as 
before.
   
   `traverse_chain` remains the sole authority for the per-call output limit 
and resume offset, and is still invoked in probe-row order, so the 
**limit/offset resume protocol is unchanged**. Heads looked up for rows past 
the output limit are simply discarded and recomputed on the next call (the 
lookup is a pure function of the hash). The unique-key fast path and the 
mid-chain resume handling are untouched.
   
   ## Are these changes tested?
   
   Yes:
   - New unit test `test_limit_offset_window_boundary_matches_unbounded` 
asserts the windowed lookup yields output identical to a single unbounded call, 
across more than one window (20 probe rows) with chains/singletons/misses, for 
several limits — including ones that split mid-chain and don't divide the 
window size.
   - `cargo test -p datafusion-physical-plan --lib joins::` — 968 tests pass.
   - `cargo test -p datafusion-sqllogictest --test sqllogictests -- joins` — 
passes.
   
   ## Are there any user-facing changes?
   
   No. Internal performance optimization with identical results.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to