Jiaqi-YP7 opened a new pull request, #18855: URL: https://github.com/apache/nuttx/pull/18855
Add dedicated NEON implementations for mutually aligned medium and long memcpy copies when building with __ARM_NEON__. These paths use NEON multi-register loads and stores while preserving the existing VFP implementation for non-NEON VFP configurations. NEON builds also define USE_VFP, so select the NEON implementation explicitly before falling back to VFP. Apply the same aligned-copy optimization to the armv7-a, armv7-r, and armv8-r implementations. *Note: Please adhere to [Contributing Guidelines](https://github.com/apache/nuttx/blob/master/CONTRIBUTING.md).* ## Summary This change adds dedicated NEON implementations for mutually aligned medium and long `memcpy` copies when building with `__ARM_NEON__`. The new NEON paths use 64-byte multi-register loads and stores with the existing destination alignment hint. The existing VFP implementation is preserved and remains the fallback for VFP-enabled builds without NEON. NEON builds also define `USE_VFP`, so the implementation now selects the NEON path explicitly before falling back to VFP. This is framed as an optimization for NEON-capable targets while keeping the existing VFP path valid for VFP-only builds. The same update is applied to the `armv7-a`, `armv7-r`, and `armv8-r` implementations, which share the same `USE_NEON`/`USE_VFP` structure. ## Impact This affects only ARM builds that enable `__ARM_NEON__` and use the architecture-specific `memcpy` implementation. The functional behavior of `memcpy` is unchanged. The intended impact is improved throughput for mutually aligned medium and long copies on NEON-capable ARM targets while keeping non-NEON VFP builds on the existing VFP path. M-profile implementations are not changed. `armv7-m` does not use NEON, and `armv8-m` has a separate MVE path. ## Testing **Platform:** NuttX / ARMv8-R, Cortex-R52, 1250 MHz **Compiler:** arm-none-eabi-gcc 12.2, `-O2`, NEON enabled Test design: We built and ran the same `memcpy_bench` application before and after this change, using the same board, configuration, toolchain, and runtime environment. The only intended difference between the two test images is the `arch_memcpy.S` change in this PR. The benchmark uses two DDR buffers, each 3 MB plus offset headroom and aligned to a 64-byte boundary. Each case runs 8 warm-up iterations followed by 32 measured iterations, using the platform performance counter to report best time, average time, throughput in MB/s, and cycles per byte. The test matrix is split by the execution paths in `arch_memcpy.S`: - Small copies below 64 bytes, where the new medium/long NEON loops are not expected to run. - Medium mutually aligned copies from 64 to 511 bytes, covering `.Lcpy_body_medium`. - Long mutually aligned copies from 512 bytes to 3 MB, covering `.Lcpy_body_long`. - Misaligned copies where `src & 7 != dst & 7`, covering `.Lcpy_notaligned` as a control path that is not changed by this PR. For the mutually aligned groups, I tested both fully aligned buffers (`src_off = 0`, `dst_off = 0`) and shifted-but-mutually-aligned buffers (`src_off = 5`, `dst_off = 5`). For the misaligned control group, I tested offset pairs such as `src_off = 0`, `dst_off = 1` and `src_off = 3`, `dst_off = 7`. In other words, we designed four test groups. ``` Group A Small (<64 B) Group B Medium aligned 64–511 B .Lcpy_body_medium ← add NEON path Group C Long aligned ≥ 512 B .Lcpy_body_long ← add NEON path Group D Misaligned (src&7≠dst&7).Lcpy_notaligned ``` Expected result: - Medium and long mutually aligned cases should show improved throughput or lower cycles per byte after this change. - Small-copy cases should not show meaningful change from this PR. - Misaligned control cases should remain broadly unchanged, since they continue to use the existing `.Lcpy_notaligned` path. Result: ### Group A — Small (<64 B): no change (expected) ### Group B — Medium aligned (64–511 B): **+70–73% throughput** | Size | Before (MB/s) | After (MB/s) | Gain | Before (cyc/B) | After (cyc/B) | |---|---|---|---|---|---| | 64 B | 42.7 | 73.5 | **+72 %** | 27.9 | 16.2 | | 128 B | 41.9 | 71.4 | **+70 %** | 28.5 | 16.7 | | 192 B | 43.2 | 74.6 | **+73 %** | 27.6 | 16.0 | | 256 B | 42.7 | 73.4 | **+72 %** | 27.9 | 16.2 | | 320 B | 43.2 | 74.8 | **+73 %** | 27.6 | 15.9 | | 384 B | 42.9 | 74.0 | **+73 %** | 27.8 | 16.1 | | 448 B | 43.2 | 74.9 | **+73 %** | 27.6 | 15.9 | ### Group C — Long aligned (≥ 512 B): **+73–110% throughput** | Size | Before (MB/s) | After (MB/s) | Gain | Before (cyc/B) | After (cyc/B) | |---|---|---|---|---|---| | 512 B | 43.0 | 74.3 | **+73 %** | 27.7 | 16.0 | | 1 KB | 39.5 | 74.6 | **+89 %** | 30.2 | 16.0 | | 4 KB | 38.0 | 73.6 | **+94 %** | 31.4 | 16.2 | | 8 KB | 35.1 | 72.3 | **+106 %** | 33.9 | 16.5 | | 16 KB | 31.0 | 64.8 | **+109 %** | 38.4 | 18.4 | | 64 KB | 31.0 | 64.7 | **+109 %** | 38.4 | 18.4 | | 256 KB | 30.9 | 64.7 | **+110 %** | 38.6 | 18.4 | | 1 MB | 30.8 | 64.6 | **+110 %** | 38.7 | 18.5 | | 3 MB | 30.9 | 64.6 | **+109 %** | 38.6 | 18.5 | ### Group D — Misaligned (control group): **no change (expected)** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
