phaniarnab commented on PR #2271:
URL: https://github.com/apache/systemds/pull/2271#issuecomment-2957890205

   Sure, I will provide some tests @ReneEnjilian. The GPU backend is well 
tested. Mark added an infrastructure to run all tests with GPU, and later I 
kept using that for my GPU tests, which includes cudnn and cusparse libraries 
and fused operators like conv2d-bias-add. Find two examples from my yesterday's 
run in my laptop:
   
   Grid search hyperparameter tuning for LM (no cuDNN):
   ```
   SystemDS Statistics:
   Total elapsed time:          9.070 sec.
   Total compilation time:              2.966 sec.
   Total execution time:                6.103 sec.
   Number of compiled Spark inst:       124.
   Number of executed Spark inst:       0.
   CUDA/CuLibraries init time:  1.123/1.621 sec.
   Number of executed GPU inst: 11452.
   GPU mem alloc time  (alloc(success/fail) / dealloc / set0):  
0.061(0.061/0.000) / 0.058 / 0.154 sec.
   GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):    
3156(3156/0/17376) / 2527 / 20532.
   GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):      
0.136(0.000/0.000) / 0.244(0.000/0.000) / 0.000(0.000/0.000) sec.
   GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):      
2605(0/0) / 2527(0/0) / 0(0/0).
   GPU conversion time  (sparseConv / sp2dense / dense2sp):     0.000 / 0.018 / 
0.802 sec.
   GPU conversion count (sparseConv / sp2dense / dense2sp):     0 / 210 / 210.
   Cache hits (Mem/Li/WB/FS/HDFS):      6482/0/0/0/0.
   Cache writes (Li/WB/FS/HDFS):        1/384/0/1.
   Cache times (ACQr/m, RLS, EXP):      0.290/0.010/0.051/0.162 sec.
   HOP DAGs recompiled (PRED, SB):      525/6405.
   HOP DAGs recompile time:     1.417 sec.
   Functions recompiled:                2.
   Functions recompile time:    0.008 sec.
   Spark ctx create time (lazy):        0.000 sec.
   Spark trans counts (par,bc,col):0/0/0.
   Spark trans times (par,bc,col):      0.000/0.000/0.000 secs.
   Async. OP count (pf,bc,op):  260/0/0.
   Total JIT compile time:              36.875 sec.
   Total JVM GC count:          2.
   Total JVM GC time:           0.014 sec.
   Heavy hitter instructions:
     #  Instruction  Time(s)  Count
     1  m_lm           4.865    525
     2  m_lmDS         4.744    525
     3  gpu_solve      2.118    525
     4  gpu_+          0.955   1225
     5  l2norm         0.401    525
     6  leftIndex      0.247   1925
     7  write          0.163      1
     8  gpu_ba+*       0.126   1575
     9  gpu_*          0.125   1470
    10  gpu_append     0.093    700
   ```
   
   
   ResNet18 (w/ cuDNN) ---
   ```
   SystemDS Statistics:
   Total elapsed time:          681.490 sec.
   Total compilation time:              1.958 sec.
   Total execution time:                679.532 sec.
   CUDA/CuLibraries init time:  0.686/669.978 sec.
   Number of executed GPU inst: 258.
   GPU mem alloc time  (alloc(success/fail) / dealloc / set0):  
0.016(0.016/0.000) / 0.000 / 0.021 sec.
   GPU mem alloc count (alloc(success/fail/reuse) / dealloc / set0):    
99(99/0/635) / 4 / 734.
   GPU mem tx time  (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):      
9.119(0.000/0.000) / 0.002(0.000/0.000) / 0.000(0.000/0.000) sec.
   GPU mem tx count (toDev(d2f/s2d) / fromDev(f2d/s2h) / evict(d2s/size)):      
88(0/0) / 4(0/0) / 0(0/0).
   GPU conversion time  (sparseConv / sp2dense / dense2sp):     0.000 / 1.102 / 
0.000 sec.
   GPU conversion count (sparseConv / sp2dense / dense2sp):     0 / 74 / 0.
   Cache hits (Mem/Li/WB/FS/HDFS):      170/0/0/0/0.
   Cache writes (Li/WB/FS/HDFS):        18/0/0/0.
   Cache times (ACQr/m, RLS, EXP):      0.004/0.002/0.005/0.000 sec.
   HOP DAGs recompiled (PRED, SB):      0/241.
   HOP DAGs recompile time:     0.265 sec.
   Functions recompiled:                1.
   Functions recompile time:    0.024 sec.
   Total JIT compile time:              10.543 sec.
   Total JVM GC count:          1.
   Total JVM GC time:           0.015 sec.
   Heavy hitter instructions:
     #  Instruction          Time(s)  Count
     1  gpu_conv2d_bias_add  669.544     61
     2  resnet18_forward       9.731      1
     3  bn2d_forward           9.188     60
     4  gpu_batch_norm2d       9.110     60
     5  basic_block            8.823     24
     6  rand                   0.283    162
     7  gpu_*                  0.061     26
     8  gpu_uak+               0.051      1
     9  gpu_rightIndex         0.026      1
    10  getWeights             0.023     54
   ```
   
   You can simply add `AutomatedTestBase.TEST_GPU = true;` to any Java test to 
enable GPU (e.g., GPUFullReuseTest.java). For arbitrary dml scripts, you need 
to use -gpu option as you already found out. Notice the GPU-specific 
statistics, which are well implemented and helpuful. I will find a way to send 
you my scripts.
   
   I recommend start with simple scripts, such as the individual layers under 
nn/layers. Once they run fine, you can run the scripts under nn/examples and 
nn/networks. Later we can try the scripts I implemented for MEMPHIS, which are 
more complex and test GPU memory management and copy performance. Except some 
of the recent scripts, I regularly executed the nn scripts on GPU. Outside of 
nn, LM and other simpler builtins should run without error.
   
   For the ResNet example above, notice the large cu libraries init time, which 
I mentioned before. I am curious if this issue is gone with the new CUDA.
   `CUDA/CuLibraries init time: 0.686/669.978 sec.`
   
   I agree with you that we need to make the testing framework better for GPU 
in the near future.
   
   Let me know if these help. I can run some more examples later today.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to