zeroshade opened a new pull request, #748:
URL: https://github.com/apache/arrow-go/pull/748
### Rationale for this change
The NEON assembly in `internal/utils/min_max_neon_arm64.s` was
machine-translated from compiler output (via asm2plan9s) and had two
significant inefficiencies:
1. **32-bit functions used half the available NEON register width** — `.2s`
(64-bit D-registers, 2 lanes) instead of `.4s` (128-bit Q-registers, 4 lanes),
leaving half the hardware throughput on the table.
2. **64-bit functions wasted 4 MOV instructions per loop iteration** — `BSL`
(bit select) is destructive to its mask operand, forcing register saves before
each compare+select. ARM64 provides `BIT`/`BIF` (bit insert if true/false)
which are destructive to the *accumulator* instead, eliminating the need for
saves entirely.
### What changes are included in this PR?
**Assembly optimizations (`min_max_neon_arm64.s`):**
- **32-bit (int32/uint32):** Widen all NEON operations from `.2s` to `.4s`,
processing 8 elements per loop iteration instead of 4. Use
`sminv`/`smaxv`/`uminv`/`umaxv` for single-instruction horizontal reduction
instead of manual `dup` + compare pairs. Adjust loop mask from `0xfffffffc`
(multiples of 4) to `0xfffffff8` (multiples of 8) and scalar tail threshold
from 3 to 7.
- **64-bit (int64/uint64):** Replace `BSL` + 4×`MOV` register saves with
`BIT`/`BIF` instructions. Restructure the 4 independent comparisons to be
grouped together for maximum instruction-level parallelism on out-of-order
cores, followed by 4 independent select operations.
- **Readability:** Replace `LBB0_3` style labels with descriptive names
(`int32_neon`, `int32_loop`, `int32_scalar`, etc.).
**New test file (`min_max_test.go`):**
- Correctness tests for all 4 types (int32, uint32, int64, uint64)
validating NEON results against pure Go implementation across 15 boundary sizes
including NEON/scalar transition points (1, 3, 4, 7, 8, 9, 15, 16, 31, 63, 64,
100, 1024).
- Benchmarks for all 4 types at 5 input sizes (64, 256, 1024, 8192, 65536)
with throughput reporting.
### Benchmark results (Apple M4, 6 iterations, benchstat):
```
│ before │ after
│
│ sec/op │ sec/op vs base
│
MinMaxInt32/n=64-10 5.992n ± 1% 3.675n ± 0% -38.67% (p=0.002 n=6)
MinMaxInt32/n=256-10 20.80n ± 1% 10.75n ± 1% -48.35% (p=0.002 n=6)
MinMaxInt32/n=1024-10 107.20n ± 0% 50.70n ± 0% -52.71% (p=0.002 n=6)
MinMaxInt32/n=8192-10 921.6n ± 0% 466.5n ± 0% -49.39% (p=0.002 n=6)
MinMaxInt32/n=65536-10 7.570µ ± 1% 3.909µ ± 0% -48.37% (p=0.002 n=6)
MinMaxUint32/n=64-10 6.039n ± 1% 3.694n ± 0% -38.83% (p=0.002 n=6)
MinMaxUint32/n=256-10 21.25n ± 0% 10.89n ± 0% -48.76% (p=0.002 n=6)
MinMaxUint32/n=1024-10 109.75n ± 0% 51.81n ± 0% -52.79% (p=0.002 n=6)
MinMaxUint32/n=8192-10 936.9n ± 0% 474.6n ± 0% -49.34% (p=0.002 n=6)
MinMaxUint32/n=65536-10 7.667µ ± 0% 3.960µ ± 0% -48.36% (p=0.002 n=6)
MinMaxInt64/n=64-10 11.18n ± 0% 11.10n ± 0% -0.72% (p=0.002 n=6)
MinMaxInt64/n=256-10 51.09n ± 0% 50.96n ± 0% -0.24% (p=0.022 n=6)
MinMaxInt64/n=1024-10 233.2n ± 0% 232.2n ± 0% -0.41% (p=0.013 n=6)
MinMaxInt64/n=8192-10 1.917µ ± 0% 1.910µ ± 1% -0.37% (p=0.002 n=6)
MinMaxInt64/n=65536-10 15.59µ ± 0% 15.53µ ± 0% -0.40% (p=0.004 n=6)
MinMaxUint64/n=64-10 11.10n ± 0% 11.06n ± 0% -0.41% (p=0.004 n=6)
MinMaxUint64/n=256-10 51.29n ± 0% 51.11n ± 0% ~ (p=0.052 n=6)
MinMaxUint64/n=1024-10 233.9n ± 1% 233.1n ± 0% ~ (p=0.219 n=6)
MinMaxUint64/n=8192-10 1.929µ ± 0% 1.917µ ± 0% -0.60% (p=0.006 n=6)
MinMaxUint64/n=65536-10 15.65µ ± 0% 15.59µ ± 0% -0.38% (p=0.024 n=6)
geomean 228.5n 164.8n -27.87%
```
**32-bit: ~2× throughput** (38 GB/s → 81 GB/s at n=1024). **Geomean: -27.9%
latency, +38.7% throughput.**
The 64-bit improvement is small (~0.4%) because the M4's out-of-order engine
already absorbs MOV latency via register renaming. On in-order or narrower
cores (Cortex-A55/A76) the BIT/BIF optimization would show a larger improvement.
### Are these changes tested?
Yes. New correctness tests validate all 4 NEON functions against the pure Go
reference implementation across 15 input sizes that exercise:
- Empty input (length 0)
- Scalar-only paths (length 1–7 for 32-bit, 1–3 for 64-bit)
- Exact NEON boundary (length 8 for 32-bit, length 4 for 64-bit)
- NEON + scalar tail (length 9, 15, 31, 63, 100)
- Pure NEON (length 16, 64, 1024)
Each test forces `MinInt`/`MaxInt` values at random positions to verify
extreme values are handled correctly.
### Are there any user-facing changes?
No API changes. This is a pure performance improvement to internal SIMD
routines used by Parquet statistics computation and Arrow dictionary operations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]