ppohlitze opened a new pull request, #2423:
URL: https://github.com/apache/systemds/pull/2423
# Benchmark
The benchmark uses the Java Microbenchmark Harness (JMH) framework to
measure the performance of the rewritten kernels. The result is the average
execution time in microseconds for a given parameter set which is exported to a
CSV file. Each benchmark run consists of 5 warmup iterations followed by 10
measurement iterations (1 second each), executed in a single forked JVM.
* **Matrix operands** are generated once per trial using
TestUtils.generateTestMatrixBlock() with configurable dimensions and sparsity
levels. The result matrix is reset before each iteration to eliminate
interference between measurements.
* **The setup phase**, which was slightly altered depending on the kernel,
performs format validation to ensure matrices are in the expected
representation before benchmarking.
* **For benchmarking**, the access modifiers of the kernel methods were
temporarily relaxed from private to public to allow for direct method
invocations.
# Hardware Specs
**JDK:** OpenJDK 17 Temurin (AArch64)
### Hardware Environment: Mac
* **Model:** MacBook Pro (2024), Apple M4 Chip
* **CPU:** 10 Cores (4 Performance @ 4.4 GHz and 6 Efficiency @ 2.85 GHz)
* **Architecture:** ARMv9.2-A (NEON support, no SVE)
* **Vector Capability:** 128-bit
* **Memory:** 16 GB LPDDR5 (120 GB/s Bandwidth)
* **Cache (P-Cores):** 192KB L1i / 128KB L1d per core; 16MB L2 shared
cluster cache
* **OS:** macOS Tahoe 26.2
### Hardware Environment: Windows PC
* **CPU Model:** Intel Core i5 9600K (Coffee Lake)
* **CPU:** 6 Cores / 6 Threads (Base: 3.7 GHz, Turbo: 4.6 GHz)
* **Architecture:** x86-64
* **Vector Capability:** 256-bit
* **Memory:** 16 GB DDR4-2666 (41.6 GB/s Bandwidth)
* **Cache:**
* L1 Cache: 384 KB (32 KB instruction + 32 KB data per core)
* L2 Cache: 1.5 MB (256 KB per core)
* L3 Cache: 9 MB (Shared)
* **OS:** Windows 10 Home 22H2
### Sources
* MacOS System Report
* https://support.apple.com/de-de/121552
* https://eclecticlight.co/2024/11/11/inside-m4-chips-p-cores/
* CPU-Z
*
https://www.intel.de/content/www/de/de/products/sku/134896/intel-core-i59600k-processor-9m-cache-up-to-4-60-ghz/specifications.html
**A Note on Hardware Vectorization:** Although the Apple M4 architecture
supports ARMv9 and reports FEAT_SME (Scalable Matrix Extension), macOS does not
currently expose standard SVE registers. Consequently, the JDK 17 Vector API
defaults to the 128-bit NEON instruction set on this platform. This limits the
SIMD lane count to 2, whereas the Windows environment utilizes AVX2 a lane
count of 4.
# Performance Analysis
Raw Result files:
[https://github.com/ppohlitze/dia-project-benchmark-results](https://github.com/ppohlitze/dia-project-benchmark-results)
## DenseDenseSparse
### Benchmark Result Summary
* the vectorized implementation is more than twice as fast as the baseline
* most significant gains occur with the highest density matrices
* minor performance regressions occur on sparser matrices, where the
overhead of vector preparation outweighs the benefits of SIMD
* significantly better performance on the Intel CPU, which is likely due to
the higher lane count and hardware support for AVX2
### Benchmark Parameters
* **m:** 1024, 1050, 2048, 4073, 4096, 8192
* **cd:** 1
* **n:** 1024, 1050, 2048, 4073, 4096, 8192
* **Sparsity Left:** 0.5, 0.75, 1.0
* **Sparsity Right:** 0.001, 0.01, 0.1, 0.2
* **Total Configs:** 192
### Mac
**Geometric Mean Speedup:** 2.2943x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 5.25x | 1 | 4096 | 2048 | 1.0 | 0.2 |
| 4.97x | 1 | 8192 | 2048 | 1.0 | 0.2 |
| 4.87x | 1 | 4096 | 4096 | 1.0 | 0.2 |
| 4.81x | 1 | 2048 | 2048 | 1.0 | 0.001 |
| 4.79x | 1 | 4096 | 1024 | 1.0 | 0.001 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.83x | 1 | 2048 | 8192 | 0.5 | 0.01 |
| 0.84x | 1 | 4096 | 1024 | 0.5 | 0.01 |
| 0.87x | 1 | 1024 | 1024 | 0.5 | 0.01 |
| 0.90x | 1 | 2048 | 8192 | 0.75 | 0.001 |
| 0.90x | 1 | 4096 | 2048 | 0.5 | 0.01 |
### Windows
**Geometric Mean Speedup:** 2.9540x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 7.07x | 1 | 1024 | 1024 | 0.75 | 0.2 |
| 6.69x | 1 | 4096 | 4096 | 1.0 | 0.2 |
| 6.56x | 1 | 1024 | 2048 | 1.0 | 0.2 |
| 5.86x | 1 | 8192 | 4096 | 0.75 | 0.2 |
| 5.73x | 1 | 2048 | 1024 | 1.0 | 0.2 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.57x | 1 | 8192 | 8192 | 0.5 | 0.01 |
| 1.11x | 1 | 8192 | 8192 | 0.75 | 0.01 |
| 1.13x | 1 | 8192 | 8192 | 0.5 | 0.001 |
| 1.14x | 1 | 4096 | 8192 | 0.5 | 0.001 |
| 1.30x | 1 | 2048 | 1024 | 0.5 | 0.001 |
---
## DenseSparseDense
### Benchmark Result Summary
* the Vector API version is 5x to 25x slower than the scalar implementation
* performance decreases as density increases, suggesting that the SIMD
overhead scales with the number of non-zero elements
* the largest speedups occur for the highest right hand side sparsities. In
these cases we mostly execute the scalar tail, since rows contain less elements
than the SIMD vector length. This indicates that the Vector API's gather and
scatter operations (fromArray() and intoArray()) are the primary bottlenecks
* again, better performance on the Intel CPU
### Benchmark Parameters
* **m:** 1, 1024, 4096
* **cd:** 1
* **n:** 1024, 4096
* **Sparsity Left:** 0.5, 0.75, 1.0
* **Sparsity Right:** 0.001, 0.01, 0.2
* **Total Configs:** 54 (I had to significantly reduce the number of configs
because the kernel is prohibitively slow for larger matrices)
### Mac
**Geometric Mean Speedup:** 0.1125x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | m | cd | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.68x | 4096 | 1024 | 1024 | 0.75 | 0.001 |
| 0.67x | 1024 | 1024 | 1024 | 0.5 | 0.001 |
| 0.67x | 1024 | 1024 | 1024 | 0.75 | 0.001 |
| 0.67x | 4096 | 1024 | 1024 | 0.5 | 0.001 |
| 0.47x | 4096 | 1024 | 1024 | 1.0 | 0.001 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | m | cd | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.04x | 1024 | 1024 | 1024 | 1.0 | 0.2 |
| 0.04x | 4096 | 1024 | 1024 | 1.0 | 0.2 |
| 0.04x | 1024 | 4096 | 4096 | 1.0 | 0.2 |
| 0.04x | 1 | 4096 | 4096 | 1.0 | 0.2 |
| 0.04x | 1024 | 4096 | 4096 | 0.75 | 0.2 |
### Windows
**Geometric Mean Speedup:** 0.3121x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- |
| 0.91x | 4096 | 1024 | 0.5 | 0.001 |
| 0.87x | 1024 | 1024 | 0.75 | 0.001 |
| 0.87x | 4096 | 1024 | 1.0 | 0.001 |
| 0.86x | 4096 | 1024 | 0.75 | 0.001 |
| 0.85x | 1024 | 1024 | 1.0 | 0.001 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- |
| 0.13x | 1024 | 1024 | 1.0 | 0.2 |
| 0.13x | 4096 | 1024 | 1.0 | 0.2 |
| 0.13x | 4096 | 4096 | 1.0 | 0.2 |
| 0.14x | 1 | 1024 | 0.75 | 0.2 |
| 0.14x | 1024 | 4096 | 1.0 | 0.2 |
---
## DenseSparseSparse
### Benchmark Result Summary
* the Vector API implementation is 12x – 100x slower at high sparsity but
achieves a 1.5x – 3.3x speedup as density increases toward 20%
* the cost of initializing and scanning the dense intermediate buffer for
every row dominates execution time when nnzs are rare
* better performance on the Intel CPU
### Benchmark Parameters
* **m:** 1024, 1050, 2048, 4073, 4096, 8192
* **cd:** 1
* **n:** 1024, 1050, 2048, 4073, 4096, 8192
* **Sparsity Left:** 0.5, 0.75, 1.0
* **Sparsity Right:** 0.001, 0.01, 0.1, 0.2
* **Total Configs:** 432
### Mac
**Geometric Mean Speedup:** 0.1731x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 3.33x | 1 | 2048 | 4073 | 1.0 | 0.2 |
| 3.16x | 1 | 4096 | 2048 | 1.0 | 0.2 |
| 3.01x | 1 | 8192 | 2048 | 1.0 | 0.2 |
| 2.81x | 1 | 1024 | 4096 | 1.0 | 0.2 |
| 2.76x | 1 | 4096 | 1050 | 1.0 | 0.2 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.00x | 1 | 8192 | 8192 | 1.0 | 0.001 |
| 0.00x | 1 | 2048 | 8192 | 0.5 | 0.001 |
| 0.00x | 1 | 4073 | 4096 | 1.0 | 0.001 |
| 0.00x | 1 | 8192 | 4073 | 1.0 | 0.001 |
| 0.00x | 1 | 4073 | 4073 | 0.75 | 0.001 |
### Windows
**Geometric Mean Speedup:** 0.2560x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 5.36x | 1 | 4096 | 4096 | 1.0 | 0.2 |
| 5.31x | 1 | 1050 | 4096 | 1.0 | 0.2 |
| 5.13x | 1 | 4073 | 4096 | 1.0 | 0.2 |
| 5.00x | 1 | 8192 | 8192 | 0.75 | 0.2 |
| 5.00x | 1 | 4096 | 4073 | 1.0 | 0.2 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.00x | 1 | 1050 | 8192 | 0.5 | 0.001 |
| 0.00x | 1 | 2048 | 8192 | 0.5 | 0.001 |
| 0.01x | 1 | 4073 | 8192 | 0.5 | 0.001 |
| 0.01x | 1 | 8192 | 8192 | 0.5 | 0.001 |
| 0.01x | 1 | 1024 | 8192 | 0.5 | 0.001 |
---
## SparseDenseMVTallRHS
### Benchmark Result Summary
* **Mac:** the vectorized implementation is consistently 3.7x to 7.7x slower
than the scalar baseline
* the regression is most severe for high sparsity and smaller matrix
dimensions
* **Intel CPU:** the vectorized implementation is on average ~9% faster than
the scalar baseline
* the larger vector capacity and hardware support for AVX2 provide
enough throughput to offset the vector setup costs
### Benchmark Parameters
* **m:** 2048, 4096, 8192
* **cd:** 4096, 8192, 16384
* **n:** 1
* **Sparsity Left:** 0.5, 0.75, 1.0
* **Sparsity Right:** 0.05, 0.1, 0.2
* **Total Configs:** 81
### Mac
**Geometric Mean Speedup:** 0.1938x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.27x | 16384 | 8192 | 1 | 0.1 | 0.75 |
| 0.27x | 16384 | 4096 | 1 | 0.2 | 1.0 |
| 0.27x | 16384 | 4096 | 1 | 0.1 | 0.5 |
| 0.27x | 16384 | 4096 | 1 | 0.2 | 0.5 |
| 0.27x | 16384 | 8192 | 1 | 0.1 | 0.5 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.13x | 4096 | 2048 | 1 | 0.1 | 1.0 |
| 0.14x | 4096 | 2048 | 1 | 0.1 | 0.75 |
| 0.14x | 8192 | 4096 | 1 | 0.05 | 1.0 |
| 0.14x | 4096 | 2048 | 1 | 0.1 | 0.5 |
| 0.14x | 8192 | 4096 | 1 | 0.05 | 0.75 |
### Windows
**Geometric Mean Speedup:** 1.0880x
#### Top 5 Performance Gains (Speedup > 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 1.25x | 4096 | 2048 | 1 | 0.2 | 0.5 |
| 1.21x | 8192 | 8192 | 1 | 0.2 | 0.75 |
| 1.18x | 8192 | 2048 | 1 | 0.2 | 0.75 |
| 1.18x | 8192 | 2048 | 1 | 0.2 | 1.0 |
| 1.18x | 8192 | 4096 | 1 | 0.2 | 0.75 |
#### Top 5 Performance Losses (Speedup < 1.0)
| Speedup | cd | m | n | sparsityLeft | sparsityRight |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 0.95x | 16384 | 2048 | 1 | 0.05 | 0.75 |
| 0.97x | 4096 | 4096 | 1 | 0.05 | 1.0 |
| 0.97x | 4096 | 4096 | 1 | 0.05 | 0.5 |
| 0.98x | 4096 | 4096 | 1 | 0.05 | 0.75 |
| 0.98x | 4096 | 8192 | 1 | 0.05 | 0.75 |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]