phaniarnab commented on PR #2271:
URL: https://github.com/apache/systemds/pull/2271#issuecomment-2957890205
Sure, I will provide some tests @ReneEnjilian. The GPU backend is well
tested. Mark added an infrastructure to run all tests with GPU, and later I
kept using that for my GPU tests, which includes cudnn and cusparse libraries
and fused operators like conv2d-bias-add. Find two examples from my yesterday's
run in my laptop:
Grid search hyperparameter tuning for LM (no cuDNN):
```
SystemDS Statistics:
Total elapsed time: 9.070 sec.
Total compilation time: 2.966 sec.
Total execution time: 6.103 sec.
Number of compiled Spark inst: 124.
Number of executed Spark inst: 0.
CUDA/CuLibraries init time: 1.123/1.621 sec.
Number of executed GPU inst: 11452.
GPU mem alloc time (alloc(success/fail) / dealloc / set0):
0.061(0.061/0.000) / 0.058 / 0.154 sec.
GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):
3156(3156/0/17376) / 2527 / 20532.
GPU mem tx time (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):
0.136(0.000/0.000) / 0.244(0.000/0.000) / 0.000(0.000/0.000) sec.
GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):
2605(0/0) / 2527(0/0) / 0(0/0).
GPU conversion time (sparseConv / sp2dense / dense2sp): 0.000 / 0.018 /
0.802 sec.
GPU conversion count (sparseConv / sp2dense / dense2sp): 0 / 210 / 210.
Cache hits (Mem/Li/WB/FS/HDFS): 6482/0/0/0/0.
Cache writes (Li/WB/FS/HDFS): 1/384/0/1.
Cache times (ACQr/m, RLS, EXP): 0.290/0.010/0.051/0.162 sec.
HOP DAGs recompiled (PRED, SB): 525/6405.
HOP DAGs recompile time: 1.417 sec.
Functions recompiled: 2.
Functions recompile time: 0.008 sec.
Spark ctx create time (lazy): 0.000 sec.
Spark trans counts (par,bc,col):0/0/0.
Spark trans times (par,bc,col): 0.000/0.000/0.000 secs.
Async. OP count (pf,bc,op): 260/0/0.
Total JIT compile time: 36.875 sec.
Total JVM GC count: 2.
Total JVM GC time: 0.014 sec.
Heavy hitter instructions:
# Instruction Time(s) Count
1 m_lm 4.865 525
2 m_lmDS 4.744 525
3 gpu_solve 2.118 525
4 gpu_+ 0.955 1225
5 l2norm 0.401 525
6 leftIndex 0.247 1925
7 write 0.163 1
8 gpu_ba+* 0.126 1575
9 gpu_* 0.125 1470
10 gpu_append 0.093 700
```
ResNet18 (w/ cuDNN) ---
```
SystemDS Statistics:
Total elapsed time: 681.490 sec.
Total compilation time: 1.958 sec.
Total execution time: 679.532 sec.
CUDA/CuLibraries init time: 0.686/669.978 sec.
Number of executed GPU inst: 258.
GPU mem alloc time (alloc(success/fail) / dealloc / set0):
0.016(0.016/0.000) / 0.000 / 0.021 sec.
GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):
99(99/0/635) / 4 / 734.
GPU mem tx time (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):
9.119(0.000/0.000) / 0.002(0.000/0.000) / 0.000(0.000/0.000) sec.
GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):
88(0/0) / 4(0/0) / 0(0/0).
GPU conversion time (sparseConv / sp2dense / dense2sp): 0.000 / 1.102 /
0.000 sec.
GPU conversion count (sparseConv / sp2dense / dense2sp): 0 / 74 / 0.
Cache hits (Mem/Li/WB/FS/HDFS): 170/0/0/0/0.
Cache writes (Li/WB/FS/HDFS): 18/0/0/0.
Cache times (ACQr/m, RLS, EXP): 0.004/0.002/0.005/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/241.
HOP DAGs recompile time: 0.265 sec.
Functions recompiled: 1.
Functions recompile time: 0.024 sec.
Total JIT compile time: 10.543 sec.
Total JVM GC count: 1.
Total JVM GC time: 0.015 sec.
Heavy hitter instructions:
# Instruction Time(s) Count
1 gpu_conv2d_bias_add 669.544 61
2 resnet18_forward 9.731 1
3 bn2d_forward 9.188 60
4 gpu_batch_norm2d 9.110 60
5 basic_block 8.823 24
6 rand 0.283 162
7 gpu_* 0.061 26
8 gpu_uak+ 0.051 1
9 gpu_rightIndex 0.026 1
10 getWeights 0.023 54
```
You can simply add `AutomatedTestBase.TEST_GPU = true;` to any Java test to
enable GPU (e.g., GPUFullReuseTest.java). For arbitrary dml scripts, you need
to use -gpu option as you already found out. Notice the GPU-specific
statistics, which are well implemented and helpuful. I will find a way to send
you my scripts.
I recommend start with simple scripts, such as the individual layers under
nn/layers. Once they run fine, you can run the scripts under nn/examples and
nn/networks. Later we can try the scripts I implemented for MEMPHIS, which are
more complex and test GPU memory management and copy performance. Except some
of the recent scripts, I regularly executed the nn scripts on GPU. Outside of
nn, LM and other simpler builtins should run without error.
For the ResNet example above, notice the large cu libraries init time, which
I mentioned before. I am curious if this issue is gone with the new CUDA.
`CUDA/CuLibraries init time: 0.686/669.978 sec.`
I agree with you that we need to make the testing framework better for GPU
in the near future.
Let me know if these help. I can run some more examples later today.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]