[I] GPU and CPU mixed cluster schedule [incubator-gluten]

via GitHub Fri, 30 Jan 2026 00:17:29 -0800


jinchengchenghh opened a new issue, #11524:
URL: https://github.com/apache/incubator-gluten/issues/11524


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   We suppose to schedule some IO bound tasks such as the stage contains table 
scan to CPU node, and some computation intensive tasks to GPU.
   Now Spark has this ability to do stage resource scheduler by resource 
profile as this document 
https://spark.apache.org/docs/latest/configuration.html#custom-resource-scheduling-and-configuration-overview
 describes, in Gluten, there has been offheap/onheap memory allocation adjusted 
by ResourceProfile
   
   This script describes how to set up GPU host environment, the script has 
executed on the IBM internal AMI linux image, so if you use IBM pipeline 
`pipeline-create-dev-vm` and select GPU node such as g4dn.xlarge, the 
environment is ready, no need to execute the script.
   
https://raw.githubusercontent.com/jinchengchenghh/gluten/cudf_script/dev/start_cudf_amazon.sh
   Note: The environment has been upgraded to cuda 13.1 because cudf build 
issue, but the script install cuda 12.8, it is outdated.
   
   This document describes how to set up yarn on GPU node.
   
https://docs.nvidia.com/spark-rapids/user-guide/23.10/getting-started/yarn-gpu.html
   
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html
   https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html
   
   GPU document describes how to build with GPU
   
https://github.com/apache/incubator-gluten/blob/main/docs/get-started/VeloxGPU.mdutdated.
   
   Existing offheap/onheap memory ResourceProfile allocation, we should use the 
similar way to set the profile to require 1 GPU, now the Spark cannot set the 
core number by resource profile, this feature is under developing.
   https://github.com/apache/incubator-gluten/pull/8209
   
   We could use TPCDS q95 to test.
   
   The query runs successfully on yarn, but if we set up the environment 
according to 
https://docs.nvidia.com/spark-rapids/user-guide/23.10/getting-started/yarn-gpu.html,
 the query will hang, I also tried stand alone mode before, it also hangs.
   
   
   ### Gluten version
   
   _No response_
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   ```bash
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] GPU and CPU mixed cluster schedule [incubator-gluten]

Reply via email to