[PR] Vector API Implementation for dense codegen primitives (Divisions, Aggregations, Comparisons, MultiplyAdd) + benchmarks [systemds]

via GitHub Fri, 30 Jan 2026 14:57:53 -0800


JulianJuelg opened a new pull request, #2428:
URL: https://github.com/apache/systemds/pull/2428


   This PR adds a Java Vector API implementation for dense codegen primitives 
in the following groups:
   
   - Aggregation
   - Division
   - Comparison
   - Multiply-add (remaining)
   
   The new vectorized implementations were benchmarked against the previous 
scalar-loop versions (see results below) with JMH microbenchmarks and a 
standalone Java benchmark suite included in this PR. In most cases, both 
harnesses show the same trend. In caseswhere they differ slightly, JMH is used 
as the primary signal due to lower volatility.
   
   For each primitive, I compared the Vector API version to the existing scalar 
loop:
   - If performance was equal, or better, I replaced the scalar loop with the 
vectorized implementation.
   - If the Vector API version was slower, I kept the scalar implementation as 
the default and left the vectorized version in the codebase for reference
   
   Benchmark setup
   JDK version : 21
   JMH version: 1.37
   OS: macOS
   Machine: (Apple M2/M, 16 GB RAM, 128-bit vector width/ SIMD)
   Input size (double arrays): 1,000,000 elements
   Warmup time: 1s per primitive
   Measurement: 1 Iteration
   JMH params: 2 Forks
   
   Note: These benchmarks were run with a 128-bit SIMD vector width, which is 
only 2 lanes for doubles. On production deployments with wider SIMD (e.g., 
256-bit or 512-bit where available), the vectorized implementations are 
expected to provide equal or better speedups due to increased lane-level 
parallelism.
   
   
   | Primitive Function | ns/op (JMH) | JMH Test: Speedup with Vector API | 
Java Test: Speedup with Vector API | Replaced |
   |---|---:|---:|---:|---|
   | vectDivAdd | 231671 | 1.066 | 1.887 |Yes |
   | vectDivAdd2 | 218818 | 1.066 | 1.686 | Yes|
   | vectDivWrite | 359339 | 0.687 | 1.489 | No |
   | vectDivWrite2 | 343183 | 0.7215 | 0.717 | No |
   | vectDivWrite3 | 535898 | 0.7821 | 0.603 | No |
   | rowMaxsVectMult | 298328 | 1.006 | 1.346 | Yes |
   | rowMaxsVectMult_aix | 738767 | 0.115 | 0.077 | No |
   | vectSum | 142065 | 0.322 | 0.565 | No |
   | vectMax | 596046 | 2.002 | 1.933 |Yes |
   | vectCountnnz | 297805 | 1.594 | 1.538 | Yes|
   | vectEqualAdd | 427437 | 1.959 | 2.077 | Yes |
   | vectEqualWrite2 | 414717 | 1.183 | 0.801 | Yes |
   | vectEqualWrite | 415329 | 1.189 | 1.402 | Yes |
   | vectGreaterAdd | 427981 | 1.936 | 2.114 | Yes |
   | vectGreaterWrite2 | 552023 | 0.588 | 0.919 | No |
   | vectGreaterWrite | 458332 | 1.309 | 0.927 | Yes |
   | vectLessAdd | 531844 | 2.433 | 2.052 | Yes |
   | vectLessWrite2 | 545457 | 1.011 | 0.951 | Yes |
   | vectLessWrite | 414025 | 1.203 | 1.039 | Yes |
   | vectLessequalAdd | 426307 | 1.960 | 2.052 | Yes |
   | vectLessequalWrite2 | 540476 | 1.014 | 0.962 | Yes |
   | vectLessequalWrite | 414514 | 1.181 | 0.953 | Yes |
   | vectMin | 589668 | 2.000 | 1.996 | Yes |
   | vectMult2Add | 228636 | 1.052 | 1.284 | Yes |
   | vectMult2Write | 377074 | 2.136 | 1.375 |Yes  |
   | vectNotequalAdd | 424749 | 1.945 | 1.643 | Yes |
   | vectNotequalWrite2 | 566433 | 0.714 | 0.821 | No |
   | vectNotequalWrite | 417206 | 1.203 | 0.941 | Yes |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Vector API Implementation for dense codegen primitives (Divisions, Aggregations, Comparisons, MultiplyAdd) + benchmarks [systemds]

Reply via email to