mboehm7 commented on pull request #1127: URL: https://github.com/apache/systemds/pull/1127#issuecomment-747736880
ok, I just pushed some minor performance improvements for sparse-sparse transpose operations which reduced the execution time of ten 2.5M x 50 (sparsity=0.1, seed=12) operations from 2.7 to 1.9s. Furthermore, I found the following: * For dense transpose operations, we have two significant parts: allocating the dense output, and the multi-threaded transpose operation. On a box with 112 vcores, the allocation is 10x more expensive than the actual transpose operation. The conclusion would be an in-place transpose wherever possible. For example, compression is injected directly after the persistent read which makes it safe to use in-place by default for both local and distributed compression. This approach would not just improve compression times but also eliminate the unnecessary temporary memory requirements. I leave this up to you though. * In all my tests the sparse-sparse operations were slower for tall&skinny compared to short&wide of the same dimensions. Looking at the implementation, it also makes sense because the sparse-sparse transpose parallelizes over input columns which allows unsynchronized output row updates. Did the above results maybe originate from earlier experiments where the scripts inverted the input arguments? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
