Re: [PR] [SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines [systemds]

via GitHub Mon, 15 Jul 2024 02:12:37 -0700


WDRshadow commented on PR #2050:
URL: https://github.com/apache/systemds/pull/2050#issuecomment-2228035321


   > > > > > Thanks, @WDRshadow, for initiating the project. As discussed 
before, please experiment with realistic use cases such as parallel scoring and 
training. You can use our DNN built-ins.
   > > > > 
   > > > > 
   > > > > Thank you for your comment. My partner @KexingLi22 is writing the 
test classes. We will see it soon. For DNN testing, we were faced with the 
awkward situation of not having enough suitable GPUs for testing. As I 
mentioned above, newer graphics cards can not run on Jcuda `10.2.0`. To be 
precise, `CUDA 10.2` is not supported by RTX30 series, A100 and newer graphics 
cards. My test environment lacks older graphics cards. Could you please help us 
to test in a multi-GPU environment with suitable GPUs after we have written the 
test classes? Or again, could you provide any testing environment for us?
   > > > 
   > > > 
   > > > Thanks for clarifying. Unfortunately, at this point, we cannot provide 
a setup. Once you are done with the project, I can run some performance tests 
along with our performance test suits. But during the development period, it is 
not feasible to try every change in our shared node. Without a proper setup of 
two GPUs, it will be very hard to complete this project. I can offer two 
possible directions from here:
   > > > 
   > > > 1. Try running SystemDS on this setup with one GPU at CUDA 10.2 and 
the other at 11. CUDA 11 has some API differences and may not be able to 
execute all CUDA methods, but you may still have a functioning system. However, 
I never tried this myself and unsure about the behavior.
   > > > 2. Instead of multi-GPU, first implement a multi-stream single-GPU 
parfor. You need a single GPU with CUDA 10.2. You can use the Jcuda API to 
create multiple GPU streams, and assign a stream to each parfor thread. This is 
probably a better alternative.
   > > 
   > > 
   > > We got a double RTX2080Ti server and tested scripts in 
`scripts/nn/example`. Except `AttentionExample` can't recognize the operator 
`_map` and `Example-MNIST_2NN_Leaky_ReLu_Softmax` can't find the source file 
`mnist_2NN.dml`, the others can run good. But I know none of them are optimized 
for multiple GPUs. The only function that is currently optimized for multiple 
GPUs is `parfor`. We will keep testing the scripts in 
`src/test/java/org/apache/sysds/test/functions/parfor` and write new test 
scripts for multi-GPUs cases.
   > 
   > Thanks. You do not have to optimize all NN workloads for multi-GPU. Just 
implementing a robust parfor support is sufficient for this project. Please 
write scoring scenario using parfor. Create a random matrix of test images and 
take one of the model. For each row, call the forward path from within a 
parfor, allowing parallel scoring. Store the inferred class in a separate 
vector. I hope to see some performance improvement of utilizing multiple GPUs. 
The parfor tests are not ideal for this project as the operations in those 
scripts were not targeted for GPUs. You may not see any speedups. However, you 
can use those tests for unit testing. Did you verify that you are actually 
using both the GPUs?
   
   Thanks for the suggestion. It will be helpful for @KexingLi22 writing the 
test instances. 
   
   There is no doubt that SystemDS uses multiple GPUs for the parfor 
computation. We have used two ways to proof:
   
   1. we have used parfor for multiplication of two 10000*10000 matrices in our 
simple test case and there is a significant reduction in runtime in case of two 
GPUs as compared to single GPU.
   
   2. We can clearly see in the java debug environment that in the parallel 
computation in the `executeLocalParFor` function of the `ParForProgramBlock` 
class, the `LocalParWorker` and its `Thread` corresponding to the two GPUs take 
on several computation `Task`s. For a test example, in a matrix math 
calculation by using `parfor`, the RTX4070 calculated 8 Tasks while the GTX1080 
calculated 4. 
   
   We'll show about these in our test code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [SYSTEMDS-2951] Multi-GPU Support for End-to-End ML Pipelines [systemds]

Reply via email to