On Mon, 26 May 2025 at 18:19, Christian König <christian.koe...@amd.com> wrote: > > Hi Tejun, > > On 5/23/25 19:06, Tejun Heo wrote: > > Hello, Christian. > > > > On Fri, May 23, 2025 at 09:58:58AM +0200, Christian König wrote: > > ... > >>> - There's a GPU workload which uses a sizable amount of system memory for > >>> the pool being discussed in this thread. This GPU workload is very > >>> important, so we want to make sure that other activities in the system > >>> don't bother it. We give it plenty of isolated CPUs and protect its > >>> memory > >>> with high enough memory.low. > >> > >> That situation simply doesn't happen. See isolation is *not* a requirement > >> for the pool. > > ... > >> See the submission model of GPUs is best effort. E.g. you don't guarantee > >> any performance isolation between processes whatsoever. If we would start > >> to do this we would need to start re-designing the HW. > > > > This is a radical claim. Let's table the rest of the discussion for now. I > > don't know enough to tell whether this claim is true or not, but for this to > > be true, the following should be true: > > > > Whether the GPU memory pool is reclaimed or not doesn't have noticeable > > performance implications on the GPU performance. > > > > Is this true? > > Yes, that is true. Today the GPUs need the memory for correctness, not for > performance anymore. > > The performance improvements we have seen with this approach 15 or 20 years > ago are negligible by todays standards. > > It's just that Windows still offers the functionality today and when you > bringup hardware on Linux you sometimes run into problems and find that the > engineers who designed the hardware/firmware relied on having this. > > > As for the scenario that I described above, I didn't just come up with it. > > I'm only supporting from system side but that's based on what our ML folks > > are doing right now. We have a bunch of lage machines with multiple GPUs > > running ML workloads. The workloads can run for a long time spread across > > many machines and they synchronize frequently, so any performance drop on > > one GPU lowers utiliization on all involved GPUs which can go up to three > > digits. For example, any scheduling disturbances on the submitting thread > > propagates through the whole cluster and slows down all involved GPUs. > > For the HPC/ML use case this feature is completely irrelevant. ROCm, Cuda, > OpenCL, OpenMP etc... don't even expose something like this in their higher > level APIs as far as I know.
What do we consider higher level here btw? HIP and CUDA both expose something like hipHostMallocWriteCombined, there is also hipHostMallocCoherent which may or may not have an effect. > > Where this here matters is things like scanout on certain laptops, digital > rights management in cloud gaming, hacks for getting high end GPUs to work on > ARM boards (e.g. rasberry pie etc...). > > > Also, because these machines are large on the CPU and memory sides too and > > aren't doing whole lot other than managing the GPUs, people want to put on a > > significant amount of CPU work on them which can easily create at least > > moderate memory pressure. Is the claim that the combined write memory pool > > doesn't have any meaningful impact on the GPU workload performance? > > When the memory pool is active on such systems I would strongly advise to > question why it is used in the first place. > > The main reason why we still need it for business today is cloud gaming. And > for this particular use case you absolutely do want to share the pool between > cgroups or otherwise the whole use case breaks. I'm still not convinced on this being totally true, which means either I'm misunderstanding how cloud gaming works or you are underestimating how cgroups work, My model for cloud gaming is, you have some sort of orchestrator service running that spawns a bunch of games in their own cgroups and those games would want to operate as independently as possible. Now if the toplevel cgroup or if none the root cgroup exists and the game cgroups are all underneath it, then I think this would operate more optimally for each game, since a) if a game uses uncached memory continuously it will have it's own pool of uncached memory that doesn't get used by anyone else, thus making that game more consistent. b) if and when the game exits, the pool will be returned to the parent cgroup to use, this memory should then be reused by other games the are started subsequently. The only thing I'm not sure is how the parent pool gets used once it's built up for new children, need to spend more time reading list_lru code. The list_lru change might actually be useful for us without cgroups as it might be able to hide some of our per-numa stuff. Dave.