phaniarnab commented on PR #2271: URL: https://github.com/apache/systemds/pull/2271#issuecomment-2957890205
Sure, I will provide some tests @ReneEnjilian. The GPU backend is well tested. Mark added an infrastructure to run all tests with GPU, and later I kept using that for my GPU tests, which includes cudnn and cusparse libraries and fused operators like conv2d-bias-add. Find two examples from my yesterday's run in my laptop: Grid search hyperparameter tuning for LM (no cuDNN): ``` SystemDS Statistics: Total elapsed time: 9.070 sec. Total compilation time: 2.966 sec. Total execution time: 6.103 sec. Number of compiled Spark inst: 124. Number of executed Spark inst: 0. CUDA/CuLibraries init time: 1.123/1.621 sec. Number of executed GPU inst: 11452. GPU mem alloc time (alloc(success/fail) / dealloc / set0): 0.061(0.061/0.000) / 0.058 / 0.154 sec. GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0): 3156(3156/0/17376) / 2527 / 20532. GPU mem tx time (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 0.136(0.000/0.000) / 0.244(0.000/0.000) / 0.000(0.000/0.000) sec. GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 2605(0/0) / 2527(0/0) / 0(0/0). GPU conversion time (sparseConv / sp2dense / dense2sp): 0.000 / 0.018 / 0.802 sec. GPU conversion count (sparseConv / sp2dense / dense2sp): 0 / 210 / 210. Cache hits (Mem/Li/WB/FS/HDFS): 6482/0/0/0/0. Cache writes (Li/WB/FS/HDFS): 1/384/0/1. Cache times (ACQr/m, RLS, EXP): 0.290/0.010/0.051/0.162 sec. HOP DAGs recompiled (PRED, SB): 525/6405. HOP DAGs recompile time: 1.417 sec. Functions recompiled: 2. Functions recompile time: 0.008 sec. Spark ctx create time (lazy): 0.000 sec. Spark trans counts (par,bc,col):0/0/0. Spark trans times (par,bc,col): 0.000/0.000/0.000 secs. Async. OP count (pf,bc,op): 260/0/0. Total JIT compile time: 36.875 sec. Total JVM GC count: 2. Total JVM GC time: 0.014 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 m_lm 4.865 525 2 m_lmDS 4.744 525 3 gpu_solve 2.118 525 4 gpu_+ 0.955 1225 5 l2norm 0.401 525 6 leftIndex 0.247 1925 7 write 0.163 1 8 gpu_ba+* 0.126 1575 9 gpu_* 0.125 1470 10 gpu_append 0.093 700 ``` ResNet18 (w/ cuDNN) --- ``` SystemDS Statistics: Total elapsed time: 681.490 sec. Total compilation time: 1.958 sec. Total execution time: 679.532 sec. CUDA/CuLibraries init time: 0.686/669.978 sec. Number of executed GPU inst: 258. GPU mem alloc time (alloc(success/fail) / dealloc / set0): 0.016(0.016/0.000) / 0.000 / 0.021 sec. GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0): 99(99/0/635) / 4 / 734. GPU mem tx time (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 9.119(0.000/0.000) / 0.002(0.000/0.000) / 0.000(0.000/0.000) sec. GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)): 88(0/0) / 4(0/0) / 0(0/0). GPU conversion time (sparseConv / sp2dense / dense2sp): 0.000 / 1.102 / 0.000 sec. GPU conversion count (sparseConv / sp2dense / dense2sp): 0 / 74 / 0. Cache hits (Mem/Li/WB/FS/HDFS): 170/0/0/0/0. Cache writes (Li/WB/FS/HDFS): 18/0/0/0. Cache times (ACQr/m, RLS, EXP): 0.004/0.002/0.005/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/241. HOP DAGs recompile time: 0.265 sec. Functions recompiled: 1. Functions recompile time: 0.024 sec. Total JIT compile time: 10.543 sec. Total JVM GC count: 1. Total JVM GC time: 0.015 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 gpu_conv2d_bias_add 669.544 61 2 resnet18_forward 9.731 1 3 bn2d_forward 9.188 60 4 gpu_batch_norm2d 9.110 60 5 basic_block 8.823 24 6 rand 0.283 162 7 gpu_* 0.061 26 8 gpu_uak+ 0.051 1 9 gpu_rightIndex 0.026 1 10 getWeights 0.023 54 ``` You can simply add `AutomatedTestBase.TEST_GPU = true;` to any Java test to enable GPU (e.g., GPUFullReuseTest.java). For arbitrary dml scripts, you need to use -gpu option as you already found out. Notice the GPU-specific statistics, which are well implemented and helpuful. I will find a way to send you my scripts. I recommend start with simple scripts, such as the individual layers under nn/layers. Once they run fine, you can run the scripts under nn/examples and nn/networks. Later we can try the scripts I implemented for MEMPHIS, which are more complex and test GPU memory management and copy performance. Except some of the recent scripts, I regularly executed the nn scripts on GPU. Outside of nn, LM and other simpler builtins should run without error. For the ResNet example above, notice the large cu libraries init time, which I mentioned before. I am curious if this issue is gone with the new CUDA. `CUDA/CuLibraries init time: 0.686/669.978 sec.` I agree with you that we need to make the testing framework better for GPU in the near future. Let me know if these help. I can run some more examples later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org