Biranavan-Parameswaran commented on PR #2376:
URL: https://github.com/apache/systemds/pull/2376#issuecomment-3662113175

   This patch introduces multi threading support for the roll operation to 
improve performance across a wide range of matrix shapes and sizes. The new 
`MatrixRollPerf` benchmark was added to evaluate performance characteristics 
under different workloads, while the existing `RollOperationThreadSafetyTest` 
was intentionally kept to ensure correctness and equivalence between single 
threaded and multithreaded execution paths.
   
   ---
   
   ## RollOperation. Single vs Multithreaded Performance Summary
   
   ### Test Coverage and Motivation
   
   Performance was evaluated across multiple matrix configurations to 
understand how matrix shape, density, and size affect multithreaded scaling. 
The goal was to validate expected speedups and identify edge cases where 
multithreading may not be beneficial.
   
   ---
   
   ## General Performance Observations
   
   ### Dense Matrices with Moderate Aspect Ratios
   
   For dense matrices with moderate to large column counts, multithreaded 
execution consistently improves performance.
   
   - Typical speedups range from **~1.3× to ~4×**, depending on matrix size and 
shape.
   - Larger matrices benefit more from multithreading since overhead becomes 
less significant.
   
   These results confirm that the multithreaded roll implementation behaves as 
expected for typical dense workloads.
   
   ---
   
   ### Sparse Matrices
   
   Sparse matrices show more nuanced behavior.
   
   - For small or moderately sized sparse matrices, multithreading can be 
slower due to scheduling and synchronization overhead.
   - As matrix sizes grow, multithreading becomes beneficial, with observed 
speedups up to **~4×**.
   
   This aligns with my expectations. Sparse workloads require sufficient actual 
work per thread to justify parallel execution.
   
   ---
   
   ## Matrices with Few Columns and Large Row Counts
   
   Additional experiments were conducted for very tall matrices with few 
columns, including the extreme case of `clen == 1` (column vectors). These 
shapes were specifically tested following @janniklinde's feedback.
   
   ### Initial Behavior
   
   For column vectors, the original multithreaded implementation performed 
significantly worse than the single threaded version. This was due to many 
small copy operations and excessive threading overhead.
   
   ### Attempted Optimization
   
   An optimized multithreaded implementation was prototyped for the case where 
the underlying block is `clen == 1` (column vectors). This version replaced row 
by row copies with one or two large contiguous `System.arraycopy` calls, 
similar to the existing optimized path in `copyDenseMtx`.
   
   This optimization significantly improved the multithreaded behavior compared 
to the previous implementation.
   
   ### Final Results
   
   Despite the improvement, benchmarking shows that even with this optimization:
   
   - For very large column vectors (100M–150M rows), **single threaded 
execution remains faster**.
   - Example results:
     - 100M × 1 dense. ~50 ms (ST) vs ~54 ms (MT)
     - 150M × 1 dense. ~68 ms (ST) vs ~89 ms (MT)
   
   This indicates that the operation is memory bandwidth bound and already near 
optimal when using one or two large `arraycopy` calls. The overhead of 
multithreading outweighs its benefits, even at large sizes.
   
   Because the optimized multithreaded path for `clen == 1` does not provide a 
net performance benefit, this change was **not committed**, as it would 
increase complexity without improving performance.
   
   ---
   
   ## Conclusion and Proposed Direction
   
   - Multithreading provides strong and consistent speedups for dense and 
sufficiently large sparse matrices.
   - Sparse matrices and column vectors can suffer from threading overhead.
   - Column vectors (`clen == 1`) are best handled by a **single threaded bulk 
copy**, as this already saturates memory bandwidth.
   
   Based on these findings, a reasonable follow up improvement would be to 
explicitly force a **single threaded roll path for `clen == 1`**, while 
retaining multithreaded execution for wider matrices where it demonstrably 
improves performance.
   
   If there is agreement on this approach, I can follow up with a targeted 
change implementing this specialization.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to