Hi Ritesh, On Sun, 15 Mar 2026 09:55:11 +0530 Ritesh Harjani (IBM) <[email protected]> wrote:
> Dan Horák <[email protected]> writes: > > +cc Gaurav, > > > Hi, > > > > starting with 7.0-rc1 (meaning 6.19 is OK) the amdgpu driver fails to > > initialize on my Linux/ppc64le Power9 based system (with Radeon Pro WX4100) > > with the following in the log > > > > ... > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: GART: 256M > > 0x000000FF00000000 - 0x000000FF0FFFFFFF > > ^^^^ > So looks like this is a PowerNV (Power9) machine. correct :-) > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Detected > > VRAM RAM=4096M, BAR=4096M > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] RAM width > > 128bits GDDR5 > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit > > OK but direct DMA is limited by 0 > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: > > dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 4096M of VRAM > > memory ready > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: 32570M of GTT > > memory ready. > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to > > allocate kernel bo > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] Debug > > VRAM access will use slowpath MM access > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] GART: num > > cpu pages 4096, num gpu pages 65536 > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: [drm] PCIE GART > > of 256M enabled (table at 0x000000F4FFF80000). > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) failed to > > allocate kernel bo > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: (-12) create WB > > bo failed > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: > > amdgpu_device_wb_init failed -12 > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: > > amdgpu_device_ip_init failed > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: Fatal error > > during GPU init > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: finishing > > device. > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: probe with > > driver amdgpu failed with error -12 > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: ttm finalized > > ... > > > > After some hints from Alex and bisecting and other investigation I have > > found that > > https://github.com/torvalds/linux/commit/1471c517cf7dae1a6342fb821d8ed501af956dd0 > > is the culprit and reverting it makes amdgpu load (and work) again. > > Thanks for confirming this. Yes, this was recently added [1] > > [1]: > https://lore.kernel.org/linuxppc-dev/[email protected]/ > > > > @Gaurav, > > I am not too familiar with the area, however looking at the logs shared > by Dan, it looks like we might be always going for dma direct allocation > path and maybe the device doesn't support this address limit. > > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: iommu: 64-bit OK > but direct DMA is limited by 0 > bře 05 08:35:40 talos.danny.cz kernel: amdgpu 0000:01:00.0: > dma_iommu_get_required_mask: returning bypass mask 0xfffffffffffffff a complete kernel log is at https://gitlab.freedesktop.org/-/project/4522/uploads/c4935bca6f37bbd06bb4045c07d00b5b/kernel.log Please let me know if you need more info. Dan > Looking at the code.. > > diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c > index fe7472f13b10..d5743b3c3ab3 100644 > --- a/kernel/dma/mapping.c > +++ b/kernel/dma/mapping.c > @@ -654,7 +654,7 @@ void *dma_alloc_attrs(struct device *dev, size_t size, > dma_addr_t *dma_handle, > /* let the implementation decide on the zone to allocate from: */ > flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM); > > - if (dma_alloc_direct(dev, ops)) { > + if (dma_alloc_direct(dev, ops) || arch_dma_alloc_direct(dev)) { > cpu_addr = dma_direct_alloc(dev, size, dma_handle, flag, attrs); > } else if (use_dma_iommu(dev)) { > cpu_addr = iommu_dma_alloc(dev, size, dma_handle, flag, attrs); > > Now, do we need arch_dma_alloc_direct() here? It always returns true if > dev->dma_ops_bypass is set to true, w/o checking for checks that > dma_go_direct() has. > > whereas... > > /* > * Check if the devices uses a direct mapping for streaming DMA operations. > * This allows IOMMU drivers to set a bypass mode if the DMA mask is large > * enough. > */ > static inline bool > dma_alloc_direct(struct device *dev, const struct dma_map_ops *ops) > ..dma_go_direct(dev, dev->coherent_dma_mask, ops); > .... ... > #ifdef CONFIG_DMA_OPS_BYPASS > if (dev->dma_ops_bypass) > return min_not_zero(mask, dev->bus_dma_limit) >= > dma_direct_get_required_mask(dev); > #endif > > dma_alloc_direct() already checks for dma_ops_bypass and also if > dev->coherent_dma_mask >= dma_direct_get_required_mask(). So... > > .... Do we really need the machinary of arch_dma_{alloc|free}_direct()? > Isn't dma_alloc_direct() checks sufficient? > > Thoughts? > > -ritesh > > > > > > for the record, I have originally opened > > https://gitlab.freedesktop.org/drm/amd/-/issues/5039 > > > > > > With regards, > > > > Dan
