Biranavan-Parameswaran commented on PR #2376:
URL: https://github.com/apache/systemds/pull/2376#issuecomment-3662113175
This patch introduces multi threading support for the roll operation to
improve performance across a wide range of matrix shapes and sizes. The new
`MatrixRollPerf` benchmark was added to evaluate performance characteristics
under different workloads, while the existing `RollOperationThreadSafetyTest`
was intentionally kept to ensure correctness and equivalence between single
threaded and multithreaded execution paths.
---
## RollOperation. Single vs Multithreaded Performance Summary
### Test Coverage and Motivation
Performance was evaluated across multiple matrix configurations to
understand how matrix shape, density, and size affect multithreaded scaling.
The goal was to validate expected speedups and identify edge cases where
multithreading may not be beneficial.
---
## General Performance Observations
### Dense Matrices with Moderate Aspect Ratios
For dense matrices with moderate to large column counts, multithreaded
execution consistently improves performance.
- Typical speedups range from **~1.3× to ~4×**, depending on matrix size and
shape.
- Larger matrices benefit more from multithreading since overhead becomes
less significant.
These results confirm that the multithreaded roll implementation behaves as
expected for typical dense workloads.
---
### Sparse Matrices
Sparse matrices show more nuanced behavior.
- For small or moderately sized sparse matrices, multithreading can be
slower due to scheduling and synchronization overhead.
- As matrix sizes grow, multithreading becomes beneficial, with observed
speedups up to **~4×**.
This aligns with my expectations. Sparse workloads require sufficient actual
work per thread to justify parallel execution.
---
## Matrices with Few Columns and Large Row Counts
Additional experiments were conducted for very tall matrices with few
columns, including the extreme case of `clen == 1` (column vectors). These
shapes were specifically tested following @janniklinde's feedback.
### Initial Behavior
For column vectors, the original multithreaded implementation performed
significantly worse than the single threaded version. This was due to many
small copy operations and excessive threading overhead.
### Attempted Optimization
An optimized multithreaded implementation was prototyped for the case where
the underlying block is `clen == 1` (column vectors). This version replaced row
by row copies with one or two large contiguous `System.arraycopy` calls,
similar to the existing optimized path in `copyDenseMtx`.
This optimization significantly improved the multithreaded behavior compared
to the previous implementation.
### Final Results
Despite the improvement, benchmarking shows that even with this optimization:
- For very large column vectors (100M–150M rows), **single threaded
execution remains faster**.
- Example results:
- 100M × 1 dense. ~50 ms (ST) vs ~54 ms (MT)
- 150M × 1 dense. ~68 ms (ST) vs ~89 ms (MT)
This indicates that the operation is memory bandwidth bound and already near
optimal when using one or two large `arraycopy` calls. The overhead of
multithreading outweighs its benefits, even at large sizes.
Because the optimized multithreaded path for `clen == 1` does not provide a
net performance benefit, this change was **not committed**, as it would
increase complexity without improving performance.
---
## Conclusion and Proposed Direction
- Multithreading provides strong and consistent speedups for dense and
sufficiently large sparse matrices.
- Sparse matrices and column vectors can suffer from threading overhead.
- Column vectors (`clen == 1`) are best handled by a **single threaded bulk
copy**, as this already saturates memory bandwidth.
Based on these findings, a reasonable follow up improvement would be to
explicitly force a **single threaded roll path for `clen == 1`**, while
retaining multithreaded execution for wider matrices where it demonstrably
improves performance.
If there is agreement on this approach, I can follow up with a targeted
change implementing this specialization.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]