WDRshadow commented on PR #2050:
URL: https://github.com/apache/systemds/pull/2050#issuecomment-2241790721

   @phaniarnab  We have change our code and test again. The time now is only 
including `parfor` execution time. The `parfor` is run 3 times and we used the 
mean time. Here is the record:
   
   | test_id              |   num_interation |   1_gpu_time_sec |   
2_gpu_time_sec | boost_rate |
   
|:---------------------|-----------------:|----------------------:|----------------------:|----------------------:|
   | test01_gpuTest_10k   |            10000 |                2.0  |            
     1.7 |                 15% |
   | test01_gpuTest_20k   |            20000 |               4.0 |              
   3.0 |                 25.0% |
   | test01_gpuTest_50k   |            50000 |                 11.0 |           
      7.3 |                 33.7% |
   | test01_gpuTest_100k  |           100000 |                 22.3 |           
      15.0 |                 32.7% |
   | test01_gpuTest_200k  |           200000 |                 46.0 |           
      31.3 |                 31.9% |
   | test01_gpuTest_500k  |           500000 |                109.3 |           
      79.3 |                 27.5% |
   | Total                              |            880000 |                
194.6 |                 137.6 |                 29.3% |
   
   ### Test environment: 
   - CPU: `24 vCPU Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz`
   - GPU: `RTX2080Ti` * 2
   - RAM: `80G`
   - OS: `Ubuntu 18.04`
   - CUDA: `10.2`
   
   ### Comments:
   
   From the table it can be seen that the `boost_rate` does not reach the 
desired `50%`. This should be due to under-optimisation of `LocalParWorker` or 
GPU memory management. We have observed the following reasons that may affect 
multi-GPU optimisation:
   1. Multiple GPUs share a storage space with synchronisation locks. For 
example, `_gpuObjects` stores the caches in each `Task` for the GPUs to read 
and record the data. Each time a GPU reads that data it will cause blocking.
   2. The `TaskPartitioner` design may not be optimal. When the number of 
`Task` allocations is low in the case of a large amount of data but a small 
number of `threads`, the single Task calculation will be larger. However, there 
may be errors in GPU computation, in which case the `Task` needs to be 
recomputed, consuming more time if that error `Task` is "big". This can be 
mitigated by improving `Task` allocation.
   3. The speedup will improve greatly when multiple GPUs are equally divided 
into the `Task` and there are no errors in the computation. However, I have 
observed that in the case of dual graphics cards, one card may execute more 
`Tasks` than the other.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to