jinchengchenghh commented on issue #8851:
URL:
https://github.com/apache/incubator-gluten/issues/8851#issuecomment-2921527884
Q95 passed in single thread, result mismatch for multiple threads because
memory issue, spark-rapids uses Semaphore to control if this task can execute
now, the config `spark.rapids.sql.concurrentGpuTasks` default value is 2.
So I assume we could only access gpu memory by one thread, but GPU memory
RMM supports concurrently access the memory, the reason why memory corrupted
is not clear.
```
# Example: Parallel data processing
stream1 = stream_pool.get_stream()
stream2 = stream_pool.get_stream()
# Process data asynchronously
df1 = cudf.read_csv("data1.csv", stream=stream1)
df2 = cudf.read_csv("data2.csv", stream=stream2)
```
```
The RAPIDS Accelerator can further limit the number of tasks that are
actively sharing the GPU. It does this using a semaphore. When metrics or
documentation refers to the GPU semaphore it is referring to this. This
restriction is useful for avoiding GPU out of memory errors while still
allowing full concurrency for the portions of the job that are not executing on
the GPU. Care is taken to try and avoid doing I/O or other CPU operations while
the GPU semaphore is held. But in the case of a join two batches are required
for processing, and it is not always possible to avoid this case.
```
https://docs.nvidia.com/spark-rapids/user-guide/23.10/tuning-guide.html#number-of-concurrent-tasks-per-gpu
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]