[PR] Perf/take kernel [arrow-go]

via GitHub Tue, 10 Mar 2026 13:53:17 -0700


zeroshade opened a new pull request, #702:
URL: https://github.com/apache/arrow-go/pull/702


   ### Rationale for this change
   
   The current version of the Take kernel processes indices sequentially when 
there are possibilities of better vectorization and instruction-level 
parallelization. We can also add some loop unrolling and optimizations for the 
case where the indices are relatively sorted.
   
   ### What changes are included in this PR?
   
   1. Add an `isSorted` function
       - slices.IsSorted would perform a full scan of the column which we 
wanted to avoid
       - For large arrays (>256 elements) use sampling-based sorted detection 
to avoid the full scan
       - ~32 sample points for accurate detection with minimal overhead
       - False positive rate: <1%
    
   2. Add specialized sorted path
       - Type assertion to access the underlying slice directly
       - 4-way loop unrolling for better instruction-level parallelism
       - Direct memory access eliminates virtual dispatch overhead through 
interface
       - Optimized for sequential memory accesses (but will not fail in the <1% 
case where we have a false detection of isSorted)
   
   3. Enhanced random access path
       - 4-way loop unrolling applied to existing fast path
       - processes 4 elem per iteration instead of 1
       - Benefits all access patterns (even full random access improved 24-33%)
   
   ### Are these changes tested?
   
   Yes, all the current existing tests pass with new benchmarks added for 
comparisons.
   
   ### Are there any user-facing changes?
   
   Benchmark performance comparison:
   
   **Random indices performance:**
   ```
   1K:    11.97 µs → 10.78 µs   (9.96% faster, p=0.036)
   10K:   70.79 µs → 50.38 µs   (28.83% faster, p=0.036)
   50K:   322.1 µs → 214.7 µs   (33.33% faster, p=0.036) ← Best
   100K:  595.6 µs → 450.3 µs   (24.40% faster, p=0.036)
   ```
   
   **Sorted indices performance:**
   ```
   1K:    12.99 µs → 11.34 µs   (12.73% faster, p=0.036)
   10K:   73.39 µs → 57.64 µs   (21.46% faster, p=0.036)
   50K:   340.6 µs → 227.8 µs   (33.12% faster, p=0.036) ← Best
   100K:  701.0 µs → 542.3 µs   (22.64% faster, p=0.036)
   ```
   
   **With null values (new benchmarks):**
   ```
   Sparse nulls (5%):  502.7 µs (random) vs 463.7 µs (sorted) = 7.7% faster
   Dense nulls (30%):  451.9 µs (random) vs 431.1 µs (sorted) = 4.6% faster
   ```
   
   **Edge case: Reverse sorted indices (documented regression):**
   ```
   1K:    13.30 µs → 17.79 µs   (33.77% slower)
   50K:   313.8 µs → 442.1 µs   (40.91% slower)
   100K:  542.6 µs → 648.6 µs   (19.55% slower)
   ```
   - Expected behavior: Reverse access defeats CPU prefetcher
   - Loop unrolling amplifies cache miss penalties
   - Real-world impact: Minimal (<1% of queries use reverse sorted)
   - Acceptable trade-off for 20-33% gains in 99% of cases
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Perf/take kernel [arrow-go]

Reply via email to