[PR] perf: optimize ARM64 NEON min/max assembly [arrow-go]

via GitHub Fri, 03 Apr 2026 15:07:37 -0700


zeroshade opened a new pull request, #748:
URL: https://github.com/apache/arrow-go/pull/748


   ### Rationale for this change
   
   The NEON assembly in `internal/utils/min_max_neon_arm64.s` was 
machine-translated from compiler output (via asm2plan9s) and had two 
significant inefficiencies:
   
   1. **32-bit functions used half the available NEON register width** — `.2s` 
(64-bit D-registers, 2 lanes) instead of `.4s` (128-bit Q-registers, 4 lanes), 
leaving half the hardware throughput on the table.
   2. **64-bit functions wasted 4 MOV instructions per loop iteration** — `BSL` 
(bit select) is destructive to its mask operand, forcing register saves before 
each compare+select. ARM64 provides `BIT`/`BIF` (bit insert if true/false) 
which are destructive to the *accumulator* instead, eliminating the need for 
saves entirely.
   
   ### What changes are included in this PR?
   
   **Assembly optimizations (`min_max_neon_arm64.s`):**
   
   - **32-bit (int32/uint32):** Widen all NEON operations from `.2s` to `.4s`, 
processing 8 elements per loop iteration instead of 4. Use 
`sminv`/`smaxv`/`uminv`/`umaxv` for single-instruction horizontal reduction 
instead of manual `dup` + compare pairs. Adjust loop mask from `0xfffffffc` 
(multiples of 4) to `0xfffffff8` (multiples of 8) and scalar tail threshold 
from 3 to 7.
   - **64-bit (int64/uint64):** Replace `BSL` + 4×`MOV` register saves with 
`BIT`/`BIF` instructions. Restructure the 4 independent comparisons to be 
grouped together for maximum instruction-level parallelism on out-of-order 
cores, followed by 4 independent select operations.
   - **Readability:** Replace `LBB0_3` style labels with descriptive names 
(`int32_neon`, `int32_loop`, `int32_scalar`, etc.).
   
   **New test file (`min_max_test.go`):**
   
   - Correctness tests for all 4 types (int32, uint32, int64, uint64) 
validating NEON results against pure Go implementation across 15 boundary sizes 
including NEON/scalar transition points (1, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64, 
100, 1024).
   - Benchmarks for all 4 types at 5 input sizes (64, 256, 1024, 8192, 65536) 
with throughput reporting.
   
   ### Benchmark results (Apple M4, 6 iterations, benchstat):
   
   ```
                           │ before        │     after                          
    │
                           │    sec/op     │   sec/op     vs base               
   │
   MinMaxInt32/n=64-10       5.992n ± 1%    3.675n ± 0%   -38.67% (p=0.002 n=6)
   MinMaxInt32/n=256-10      20.80n ± 1%    10.75n ± 1%   -48.35% (p=0.002 n=6)
   MinMaxInt32/n=1024-10    107.20n ± 0%    50.70n ± 0%   -52.71% (p=0.002 n=6)
   MinMaxInt32/n=8192-10     921.6n ± 0%    466.5n ± 0%   -49.39% (p=0.002 n=6)
   MinMaxInt32/n=65536-10    7.570µ ± 1%    3.909µ ± 0%   -48.37% (p=0.002 n=6)
   MinMaxUint32/n=64-10      6.039n ± 1%    3.694n ± 0%   -38.83% (p=0.002 n=6)
   MinMaxUint32/n=256-10     21.25n ± 0%    10.89n ± 0%   -48.76% (p=0.002 n=6)
   MinMaxUint32/n=1024-10   109.75n ± 0%    51.81n ± 0%   -52.79% (p=0.002 n=6)
   MinMaxUint32/n=8192-10    936.9n ± 0%    474.6n ± 0%   -49.34% (p=0.002 n=6)
   MinMaxUint32/n=65536-10   7.667µ ± 0%    3.960µ ± 0%   -48.36% (p=0.002 n=6)
   MinMaxInt64/n=64-10       11.18n ± 0%    11.10n ± 0%    -0.72% (p=0.002 n=6)
   MinMaxInt64/n=256-10      51.09n ± 0%    50.96n ± 0%    -0.24% (p=0.022 n=6)
   MinMaxInt64/n=1024-10     233.2n ± 0%    232.2n ± 0%    -0.41% (p=0.013 n=6)
   MinMaxInt64/n=8192-10     1.917µ ± 0%    1.910µ ± 1%    -0.37% (p=0.002 n=6)
   MinMaxInt64/n=65536-10    15.59µ ± 0%    15.53µ ± 0%    -0.40% (p=0.004 n=6)
   MinMaxUint64/n=64-10      11.10n ± 0%    11.06n ± 0%    -0.41% (p=0.004 n=6)
   MinMaxUint64/n=256-10     51.29n ± 0%    51.11n ± 0%         ~ (p=0.052 n=6)
   MinMaxUint64/n=1024-10    233.9n ± 1%    233.1n ± 0%         ~ (p=0.219 n=6)
   MinMaxUint64/n=8192-10    1.929µ ± 0%    1.917µ ± 0%    -0.60% (p=0.006 n=6)
   MinMaxUint64/n=65536-10   15.65µ ± 0%    15.59µ ± 0%    -0.38% (p=0.024 n=6)
   geomean                    228.5n         164.8n        -27.87%
   ```
   
   **32-bit: ~2× throughput** (38 GB/s → 81 GB/s at n=1024). **Geomean: -27.9% 
latency, +38.7% throughput.**
   
   The 64-bit improvement is small (~0.4%) because the M4's out-of-order engine 
already absorbs MOV latency via register renaming. On in-order or narrower 
cores (Cortex-A55/A76) the BIT/BIF optimization would show a larger improvement.
   
   ### Are these changes tested?
   
   Yes. New correctness tests validate all 4 NEON functions against the pure Go 
reference implementation across 15 input sizes that exercise:
   - Empty input (length 0)
   - Scalar-only paths (length 1–7 for 32-bit, 1–3 for 64-bit)
   - Exact NEON boundary (length 8 for 32-bit, length 4 for 64-bit)
   - NEON + scalar tail (length 9, 15, 31, 63, 100)
   - Pure NEON (length 16, 64, 1024)
   
   Each test forces `MinInt`/`MaxInt` values at random positions to verify 
extreme values are handled correctly.
   
   ### Are there any user-facing changes?
   
   No API changes. This is a pure performance improvement to internal SIMD 
routines used by Parquet statistics computation and Arrow dictionary operations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf: optimize ARM64 NEON min/max assembly [arrow-go]

Reply via email to