Re: [PR] [SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines [systemds]

via GitHub Tue, 16 Jul 2024 16:05:26 -0700


phaniarnab commented on PR #2050:
URL: https://github.com/apache/systemds/pull/2050#issuecomment-2231958815


   > > > > > > Thanks, @WDRshadow, for initiating the project. As discussed 
before, please experiment with realistic use cases such as parallel scoring and 
training. You can use our DNN built-ins.
   > > > > > 
   > > > > > 
   > > > > > Thank you for your comment. My partner @KexingLi22 is writing the 
test classes. We will see it soon. For DNN testing, we were faced with the 
awkward situation of not having enough suitable GPUs for testing. As I 
mentioned above, newer graphics cards can not run on Jcuda `10.2.0`. To be 
precise, `CUDA 10.2` is not supported by RTX30 series, A100 and newer graphics 
cards. My test environment lacks older graphics cards. Could you please help us 
to test in a multi-GPU environment with suitable GPUs after we have written the 
test classes? Or again, could you provide any testing environment for us?
   > > > > 
   > > > > 
   > > > > Thanks for clarifying. Unfortunately, at this point, we cannot 
provide a setup. Once you are done with the project, I can run some performance 
tests along with our performance test suits. But during the development period, 
it is not feasible to try every change in our shared node. Without a proper 
setup of two GPUs, it will be very hard to complete this project. I can offer 
two possible directions from here:
   > > > > 
   > > > > 1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and 
the other at 11. CUDA 11 has some API differences and may not be able to 
execute all CUDA methods, but you may still have a functioning system. However, 
I never tried this myself and unsure about the behavior.
   > > > > 2. Instead of multi-GPU, first implement a multi-stream single-GPU 
parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to 
create multiple GPU streams, and assign a stream to each parfor thread. This is 
probably a better alternative.
   > > > 
   > > > 
   > > > We got a double RTX2080Ti server and tested scripts in 
`scripts/nn/example`. Except `AttentionExample` can't recognize the operator 
`_map` and `Example-MNIST_2NN_Leaky_ReLu_Softmax` can't find the source file 
`mnist_2NN.dml`, the others can run good. But I know none of them are optimized 
for multiple GPUs. The only function that is currently optimized for multiple 
GPUs is `parfor`. We will keep testing the scripts in 
`src/test/java/org/apache/sysds/test/functions/parfor` and write new test 
scripts for multi-GPUs cases.
   > > 
   > > 
   > > Thanks. You do not have to optimize all NN workloads for multi-GPU. Just 
implementing a robust parfor support is sufficient for this project. Please 
write scoring scenario using parfor. Create a random matrix of test images and 
take one of the model. For each row, call the forward path from within a 
parfor, allowing parallel scoring. Store the inferred class in a separate 
vector. I hope to see some performance improvement of utilizing multiple GPUs. 
The parfor tests are not ideal for this project as the operations in those 
scripts were not targeted for GPUs. You may not see any speedups. However, you 
can use those tests for unit testing. Did you verify that you are actually 
using both the GPUs?
   > 
   > THanks for your suggestion, @phaniarnab .
   > 
   > We have written a test class MultiGPUTest.java with single GPU test case, 
MultipleGPU test case to run the script, in which the model EfficientNet was 
trained and predicts using parfor.
   > 
   > Everything works well and the execute time of singleGPU is 35 sec 121ms, 
of the multiGPU is 27 sec 378 ms.
   > 
   > And as the advice from @WDRshadow, I also try to add the logger instance 
into both ParforBody and GPUContext to trace the thread and the GPUContext. And 
I have already add these into the log4j.properties:
   > 
   > # Enable detailed logging for specific classes
   > 
log4j.logger.org.apache.sysds.runtime.controlprogram.parfor.ParForBody=DEBUG 
log4j.logger.org.apache.sysds.runtime.instructions.gpu.context.GPUContext=DEBUG
   > 
   > But when I run the test,dml script with the parfor function, nothing about 
this, which I expected shows out : 24/07/16 10:00:00 DEBUG ParForBody - Thread 
Thread-1 assigned to GPU context 0
   > 
   > How can I solve this problems?
   
   Thanks. The numbers do not look very good. Train just once and write the 
model in the disk. In a separate script, read the model and infer the test 
instances within a parfor loop. Here is an example script [1]. You can even use 
a randomly initialized model, as we are not measuring the accuracy here. I 
expect at least a 2x improvement. Vary the test size (i.e., the number of 
iterations of parfor loop) from 10k to 100k.
   
   First focus on the development, unit testing and experiments. The logger can 
be delayed. Instead, extend the ParForStatistics class to report the number of 
GPUs used by the parfor and other relavent details. These will be printed when 
-stats is passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@systemds.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines [systemds]

Reply via email to