Baunsgaard opened a new pull request, #1801: URL: https://github.com/apache/systemds/pull/1801
Transpose Denset-> sparse: Before: single thread transpose dominated: ``` census_enc_16k-kmeans+-claWorkloadb16 -- dams-so001 Total elapsed time: 40.237 sec. 2 r' 15.085 112 3 compress 6.896 1 407,390.46 msec task-clock # 8.943 CPUs utilized 789,013,767,869 cycles # 1.937 GHz (33.31%) 938,096,650,617 instructions # 1.19 insn per cycle census_enc_16k-kmeans+-claWorkloadb16 -- dams-so001 ``` Then i removed an indirection of allocation via append on MCSR and managed the sparse vectors directly: ``` dams-so001 sysds: 81e554108686a1db2ff48ecd59e81d533d216b07 20:58:10 .------------------------------------ census_enc_16k-kmeans+-claWorkloadb16 -- dams-so001 Total elapsed time: 32.991 sec. 2 r' 9.334 112 3 compress 6.669 1 399,243.07 msec task-clock # 10.375 CPUs utilized 762,584,081,959 cycles # 1.910 GHz (33.28%) 928,305,632,331 instructions # 1.22 insn per cycle census_enc_16k-kmeans+-claWorkloadb16 -- dams-so001 ------------------------------------ ``` And finally parallelized: ``` dams-so001 sysds: 3a43b30b2b8dec983aa5cd7ea3c67c79a28b7f30 21:49:14 .------------------------------------ census_enc_16k-kmeans+-claWorkloadb16 -- dams-so001 Total elapsed time: 27.812 sec. 2 compress 6.801 1 4 r' 4.454 112 405,710.80 msec task-clock # 12.203 CPUs utilized 777,778,027,253 cycles # 1.917 GHz (33.34%) 967,872,124,119 instructions # 1.24 insn per cycle census_enc_16k-kmeans+-claWorkloadb16 -- dams-so001 ------------------------------------ baunsgaard@dams-so001:~/github/reprodu ``` In LMCG: 16x I found some optimizations to make as well. Here I added a skip list to offsetList that is hidden behind a softreference. Before: ``` SystemDS Statistics: Total elapsed time: 108.736 sec. CLA Compression Phases : 2.065/8.329/18.484/15.647/0.001/0.000 Decompression with allocation (Single, Multi, Spark, Cache) : 0/101/0/0 Decompression with allocation Time (Single , Multi) : 0.000/27.285 sec. Decompression to block (Single, Multi) : 0/0 Decompression to block Time (Single, Multi) : 0.000/0.000 sec. ``` With SkipList: ``` SystemDS Statistics: Total elapsed time: 91.367 sec. Total compilation time: 1.250 sec. CLA Compression Phases : 2.154/7.847/20.251/15.372/0.002/0.000 Decompression with allocation (Single, Multi, Spark, Cache) : 0/101/0/0 Decompression with allocation Time (Single , Multi) : 0.000/8.685 sec. Decompression to block (Single, Multi) : 0/0 Decompression to block Time (Single, Multi) : 0.000/0.000 sec. ``` And Biggest in transfer from spark i had misplaced a recompute zeros dominating the transfer by recomputing all zeros in the entire matrix when pulling back a distributed compressed matrix. This speed up for instance PCA 32x from :+1: before ``` Total elapsed time: 148.328 sec. Spark trans times (par,bc,col): 0.000/0.021/85.999 secs. ``` after ``` Total elapsed time: 65.700 sec. Spark trans times (par,bc,col): 0.000/0.021/6.980 secs. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org