mboehm7 commented on pull request #1127:
URL: https://github.com/apache/systemds/pull/1127#issuecomment-747736880


   ok, I just pushed some minor performance improvements for sparse-sparse 
transpose operations which reduced the execution time of ten 2.5M x 50 
(sparsity=0.1, seed=12) operations from 2.7 to 1.9s. Furthermore, I found the 
following:
   
   * For dense transpose operations, we have two significant parts: allocating 
the dense output, and the multi-threaded transpose operation. On a box with 112 
vcores, the allocation is 10x more expensive than the actual transpose 
operation. The conclusion would be an in-place transpose wherever possible. For 
example, compression is injected directly after the persistent read which makes 
it safe to use in-place by default for both local and distributed compression. 
This approach would not just improve compression times but also eliminate the 
unnecessary temporary memory requirements. I leave this up to you though.
   * In all my tests the sparse-sparse operations were slower for tall&skinny 
compared to short&wide of the same dimensions. Looking at the implementation, 
it also makes sense because the sparse-sparse transpose parallelizes over input 
columns which allows unsynchronized output row updates. Did the above results 
maybe originate from earlier experiments where the scripts inverted the input 
arguments?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to