[PR] perf: Cache num_output_rows in sort merge join to avoid O(n) recount [datafusion]

via GitHub Sun, 22 Feb 2026 06:54:23 -0800


andygrove opened a new pull request, #20478:
URL: https://github.com/apache/datafusion/pull/20478


   ## Which issue does this PR close?
   
   N/A - performance optimization
   
   ## Rationale for this change
   
   In the SMJ tight loop (`join_partial`), `num_unfrozen_pairs()` was called 
**twice per iteration**: once in the loop guard and once inside 
`append_output_pair`. This method iterates all chunks in `output_indices` and 
sums their lengths — O(num_chunks). Over a full batch of `batch_size` 
iterations, this makes the inner loop O(batch_size * num_chunks) instead of 
O(batch_size).
   
   ## What changes are included in this PR?
   
   Add a `num_output_rows` field to `StreamedBatch` that is incremented on each 
append and reset on freeze, replacing the O(n) summation with an O(1) field 
read.
   
   - Added `num_output_rows: usize` field to `StreamedBatch`, initialized to `0`
   - Increment `num_output_rows` in `append_output_pair()` after each append
   - `num_output_rows()` now returns the cached field directly
   - Reset to `0` in `freeze_streamed()` when `output_indices` is cleared
   - Removed the `num_unfrozen_pairs` parameter from `append_output_pair()` 
since it can now read `self.num_output_rows` directly
   
   ## Are these changes tested?
   
   Yes — all 48 existing `sort_merge_join` tests pass. This is a pure refactor 
of an internal counter with no behavioral change.
   
   ## Are there any user-facing changes?
   
   No.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: Cache num_output_rows in sort merge join to avoid O(n) recount [datafusion]

Reply via email to