[PR] SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api [systemds]

via GitHub Fri, 30 Jan 2026 06:17:30 -0800


ppohlitze opened a new pull request, #2423:
URL: https://github.com/apache/systemds/pull/2423


   #  Benchmark
   
   The benchmark uses the Java Microbenchmark Harness (JMH) framework to 
measure the performance of the rewritten kernels. The result is the average 
execution time in microseconds for a given parameter set which is exported to a 
CSV file. Each benchmark run consists of 5 warmup iterations followed by 10 
measurement iterations (1 second each), executed in a single forked JVM.
   
   * **Matrix operands** are generated once per trial using 
TestUtils.generateTestMatrixBlock() with configurable dimensions and sparsity 
levels. The result matrix is reset before each iteration to eliminate 
interference between measurements.
   * **The setup phase**, which was slightly altered depending on the kernel, 
performs format validation to ensure matrices are in the expected 
representation before benchmarking.
   * **For benchmarking**, the access modifiers of the kernel methods were 
temporarily relaxed from private to public to allow for direct method 
invocations.
   
   # Hardware Specs
   
   **JDK:** OpenJDK 17 Temurin (AArch64)
   
   ### Hardware Environment: Mac
   * **Model:** MacBook Pro (2024), Apple M4 Chip
   * **CPU:** 10 Cores (4 Performance @ 4.4 GHz and 6 Efficiency @ 2.85 GHz)
   * **Architecture:** ARMv9.2-A (NEON support, no SVE)
   * **Vector Capability:** 128-bit
   * **Memory:** 16 GB LPDDR5 (120 GB/s Bandwidth)
   * **Cache (P-Cores):** 192KB L1i / 128KB L1d per core; 16MB L2 shared 
cluster cache
   * **OS:** macOS Tahoe 26.2
   
   ### Hardware Environment: Windows PC
   * **CPU Model:** Intel Core i5 9600K (Coffee Lake)
   * **CPU:** 6 Cores / 6 Threads (Base: 3.7 GHz, Turbo: 4.6 GHz)
   * **Architecture:** x86-64
   * **Vector Capability:** 256-bit
   * **Memory:** 16 GB DDR4-2666 (41.6 GB/s Bandwidth)
   * **Cache:**
       * L1 Cache: 384 KB (32 KB instruction + 32 KB data per core)
       * L2 Cache: 1.5 MB (256 KB per core)
       * L3 Cache: 9 MB (Shared)
   * **OS:** Windows 10 Home 22H2
   
   ### Sources
   * MacOS System Report
   * https://support.apple.com/de-de/121552
   * https://eclecticlight.co/2024/11/11/inside-m4-chips-p-cores/
   * CPU-Z
   * 
https://www.intel.de/content/www/de/de/products/sku/134896/intel-core-i59600k-processor-9m-cache-up-to-4-60-ghz/specifications.html
   
   **A Note on Hardware Vectorization:** Although the Apple M4 architecture 
supports ARMv9 and reports FEAT_SME (Scalable Matrix Extension), macOS does not 
currently expose standard SVE registers. Consequently, the JDK 17 Vector API 
defaults to the 128-bit NEON instruction set on this platform. This limits the 
SIMD lane count to 2, whereas the Windows environment utilizes AVX2 a lane 
count of 4.
   
   # Performance Analysis
   
   Raw Result files: 
[https://github.com/ppohlitze/dia-project-benchmark-results](https://github.com/ppohlitze/dia-project-benchmark-results)
   
   ## DenseDenseSparse
   
   ### Benchmark Result Summary
   * the vectorized implementation is more than twice as fast as the baseline
   * most significant gains occur with the highest density matrices
   * minor performance regressions occur on sparser matrices, where the 
overhead of vector preparation outweighs the benefits of SIMD
   * significantly better performance on the Intel CPU, which is likely due to 
the higher lane count and hardware support for AVX2
   
   ### Benchmark Parameters
   * **m:** 1024, 1050, 2048, 4073, 4096, 8192
   * **cd:** 1
   * **n:** 1024, 1050, 2048, 4073, 4096, 8192
   * **Sparsity Left:** 0.5, 0.75, 1.0
   * **Sparsity Right:** 0.001, 0.01, 0.1, 0.2
   * **Total Configs:** 192
   
   ### Mac
   **Geometric Mean Speedup:** 2.2943x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 5.25x | 1 | 4096 | 2048 | 1.0 | 0.2 |
   | 4.97x | 1 | 8192 | 2048 | 1.0 | 0.2 |
   | 4.87x | 1 | 4096 | 4096 | 1.0 | 0.2 |
   | 4.81x | 1 | 2048 | 2048 | 1.0 | 0.001 |
   | 4.79x | 1 | 4096 | 1024 | 1.0 | 0.001 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.83x | 1 | 2048 | 8192 | 0.5 | 0.01 |
   | 0.84x | 1 | 4096 | 1024 | 0.5 | 0.01 |
   | 0.87x | 1 | 1024 | 1024 | 0.5 | 0.01 |
   | 0.90x | 1 | 2048 | 8192 | 0.75 | 0.001 |
   | 0.90x | 1 | 4096 | 2048 | 0.5 | 0.01 |
   
   ### Windows
   **Geometric Mean Speedup:** 2.9540x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 7.07x | 1 | 1024 | 1024 | 0.75 | 0.2 |
   | 6.69x | 1 | 4096 | 4096 | 1.0 | 0.2 |
   | 6.56x | 1 | 1024 | 2048 | 1.0 | 0.2 |
   | 5.86x | 1 | 8192 | 4096 | 0.75 | 0.2 |
   | 5.73x | 1 | 2048 | 1024 | 1.0 | 0.2 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.57x | 1 | 8192 | 8192 | 0.5 | 0.01 |
   | 1.11x | 1 | 8192 | 8192 | 0.75 | 0.01 |
   | 1.13x | 1 | 8192 | 8192 | 0.5 | 0.001 |
   | 1.14x | 1 | 4096 | 8192 | 0.5 | 0.001 |
   | 1.30x | 1 | 2048 | 1024 | 0.5 | 0.001 |
   
   ---
   
   ## DenseSparseDense
   
   ### Benchmark Result Summary
   * the Vector API version is 5x to 25x slower than the scalar implementation
   * performance decreases as density increases, suggesting that the SIMD 
overhead scales with the number of non-zero elements
   * the largest speedups occur for the highest right hand side sparsities. In 
these cases we mostly execute the scalar tail, since rows contain less elements 
than the SIMD vector length. This indicates that the Vector API's gather and 
scatter operations (fromArray() and intoArray()) are the primary bottlenecks
   * again, better performance on the Intel CPU
   
   ### Benchmark Parameters
   * **m:** 1, 1024, 4096
   * **cd:** 1
   * **n:** 1024, 4096
   * **Sparsity Left:** 0.5, 0.75, 1.0
   * **Sparsity Right:** 0.001, 0.01, 0.2
   * **Total Configs:** 54 (I had to significantly reduce the number of configs 
because the kernel is prohibitively slow for larger matrices)
   
   ### Mac
   **Geometric Mean Speedup:** 0.1125x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | m | cd | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.68x | 4096 | 1024 | 1024 | 0.75 | 0.001 |
   | 0.67x | 1024 | 1024 | 1024 | 0.5 | 0.001 |
   | 0.67x | 1024 | 1024 | 1024 | 0.75 | 0.001 |
   | 0.67x | 4096 | 1024 | 1024 | 0.5 | 0.001 |
   | 0.47x | 4096 | 1024 | 1024 | 1.0 | 0.001 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | m | cd | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.04x | 1024 | 1024 | 1024 | 1.0 | 0.2 |
   | 0.04x | 4096 | 1024 | 1024 | 1.0 | 0.2 |
   | 0.04x | 1024 | 4096 | 4096 | 1.0 | 0.2 |
   | 0.04x | 1 | 4096 | 4096 | 1.0 | 0.2 |
   | 0.04x | 1024 | 4096 | 4096 | 0.75 | 0.2 |
   
   ### Windows
   **Geometric Mean Speedup:** 0.3121x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- |
   | 0.91x | 4096 | 1024 | 0.5 | 0.001 |
   | 0.87x | 1024 | 1024 | 0.75 | 0.001 |
   | 0.87x | 4096 | 1024 | 1.0 | 0.001 |
   | 0.86x | 4096 | 1024 | 0.75 | 0.001 |
   | 0.85x | 1024 | 1024 | 1.0 | 0.001 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- |
   | 0.13x | 1024 | 1024 | 1.0 | 0.2 |
   | 0.13x | 4096 | 1024 | 1.0 | 0.2 |
   | 0.13x | 4096 | 4096 | 1.0 | 0.2 |
   | 0.14x | 1 | 1024 | 0.75 | 0.2 |
   | 0.14x | 1024 | 4096 | 1.0 | 0.2 |
   
   ---
   
   ## DenseSparseSparse
   
   ### Benchmark Result Summary
   * the Vector API implementation is 12x – 100x slower at high sparsity but 
achieves a 1.5x – 3.3x speedup as density increases toward 20%
   * the cost of initializing and scanning the dense intermediate buffer for 
every row dominates execution time when nnzs are rare
   * better performance on the Intel CPU
   
   ### Benchmark Parameters
   * **m:** 1024, 1050, 2048, 4073, 4096, 8192
   * **cd:** 1
   * **n:** 1024, 1050, 2048, 4073, 4096, 8192
   * **Sparsity Left:** 0.5, 0.75, 1.0
   * **Sparsity Right:** 0.001, 0.01, 0.1, 0.2
   * **Total Configs:** 432
   
   ### Mac
   **Geometric Mean Speedup:** 0.1731x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 3.33x | 1 | 2048 | 4073 | 1.0 | 0.2 |
   | 3.16x | 1 | 4096 | 2048 | 1.0 | 0.2 |
   | 3.01x | 1 | 8192 | 2048 | 1.0 | 0.2 |
   | 2.81x | 1 | 1024 | 4096 | 1.0 | 0.2 |
   | 2.76x | 1 | 4096 | 1050 | 1.0 | 0.2 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.00x | 1 | 8192 | 8192 | 1.0 | 0.001 |
   | 0.00x | 1 | 2048 | 8192 | 0.5 | 0.001 |
   | 0.00x | 1 | 4073 | 4096 | 1.0 | 0.001 |
   | 0.00x | 1 | 8192 | 4073 | 1.0 | 0.001 |
   | 0.00x | 1 | 4073 | 4073 | 0.75 | 0.001 |
   
   ### Windows
   **Geometric Mean Speedup:** 0.2560x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 5.36x | 1 | 4096 | 4096 | 1.0 | 0.2 |
   | 5.31x | 1 | 1050 | 4096 | 1.0 | 0.2 |
   | 5.13x | 1 | 4073 | 4096 | 1.0 | 0.2 |
   | 5.00x | 1 | 8192 | 8192 | 0.75 | 0.2 |
   | 5.00x | 1 | 4096 | 4073 | 1.0 | 0.2 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.00x | 1 | 1050 | 8192 | 0.5 | 0.001 |
   | 0.00x | 1 | 2048 | 8192 | 0.5 | 0.001 |
   | 0.01x | 1 | 4073 | 8192 | 0.5 | 0.001 |
   | 0.01x | 1 | 8192 | 8192 | 0.5 | 0.001 |
   | 0.01x | 1 | 1024 | 8192 | 0.5 | 0.001 |
   
   ---
   
   ## SparseDenseMVTallRHS
   
   ### Benchmark Result Summary
   * **Mac:** the vectorized implementation is consistently 3.7x to 7.7x slower 
than the scalar baseline
       * the regression is most severe for high sparsity and smaller matrix 
dimensions
   * **Intel CPU:** the vectorized implementation is on average ~9% faster than 
the scalar baseline
       * the larger vector capacity and hardware support for AVX2 provide 
enough throughput to offset the vector setup costs
   
   ### Benchmark Parameters
   * **m:** 2048, 4096, 8192
   * **cd:** 4096, 8192, 16384
   * **n:** 1
   * **Sparsity Left:** 0.5, 0.75, 1.0
   * **Sparsity Right:** 0.05, 0.1, 0.2
   * **Total Configs:** 81
   
   ### Mac
   **Geometric Mean Speedup:** 0.1938x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.27x | 16384 | 8192 | 1 | 0.1 | 0.75 |
   | 0.27x | 16384 | 4096 | 1 | 0.2 | 1.0 |
   | 0.27x | 16384 | 4096 | 1 | 0.1 | 0.5 |
   | 0.27x | 16384 | 4096 | 1 | 0.2 | 0.5 |
   | 0.27x | 16384 | 8192 | 1 | 0.1 | 0.5 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.13x | 4096 | 2048 | 1 | 0.1 | 1.0 |
   | 0.14x | 4096 | 2048 | 1 | 0.1 | 0.75 |
   | 0.14x | 8192 | 4096 | 1 | 0.05 | 1.0 |
   | 0.14x | 4096 | 2048 | 1 | 0.1 | 0.5 |
   | 0.14x | 8192 | 4096 | 1 | 0.05 | 0.75 |
   
   ### Windows
   **Geometric Mean Speedup:** 1.0880x
   
   #### Top 5 Performance Gains (Speedup > 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 1.25x | 4096 | 2048 | 1 | 0.2 | 0.5 |
   | 1.21x | 8192 | 8192 | 1 | 0.2 | 0.75 |
   | 1.18x | 8192 | 2048 | 1 | 0.2 | 0.75 |
   | 1.18x | 8192 | 2048 | 1 | 0.2 | 1.0 |
   | 1.18x | 8192 | 4096 | 1 | 0.2 | 0.75 |
   
   #### Top 5 Performance Losses (Speedup < 1.0)
   | Speedup | cd | m | n | sparsityLeft | sparsityRight |
   | :--- | :--- | :--- | :--- | :--- | :--- |
   | 0.95x | 16384 | 2048 | 1 | 0.05 | 0.75 |
   | 0.97x | 4096 | 4096 | 1 | 0.05 | 1.0 |
   | 0.97x | 4096 | 4096 | 1 | 0.05 | 0.5 |
   | 0.98x | 4096 | 4096 | 1 | 0.05 | 0.75 |
   | 0.98x | 4096 | 8192 | 1 | 0.05 | 0.75 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] SYSTEMDS-3855 Reimplement Matrix Multiplication kernels with vector api [systemds]

Reply via email to