On Mon, Nov 17, 2025 at 08:54:59PM +0800, Baolu Lu wrote: > On 11/13/2025 6:32 AM, Jason Gunthorpe wrote: > > On Mon, Nov 10, 2025 at 12:06:30PM +0530, Borah, Chaitanya Kumar wrote: > > > Hello Jason, > > > > > > Hope you are doing well. I am Chaitanya from the linux graphics team in > > > Intel. > > > > > > This mail is regarding a regression we are seeing in our CI runs[1] on > > > linux-next repository. > > > > > > Since the version next-20251106 [2], we are seeing our tests timing out > > > presumably caused by a GPU Hang. > > > > Thank you for reporting this. > > > > I don't have any immediate theory, so I think it will need some > > debug. Maybe Kevin or Lu have some idea? > > > > Some general thoughts to check > > > > 1) Is there an iommu fault report? I did not see one in your dmesg, > > but maybe it was truncated? It is more puzzling to see an iommu > > related error and not see a fault report.. > > > > 2) Could it be one of the special iommu behaviors to support iGPU that > > is not working? Maybe we missed one? > > > > 3) I seem to recall Lu tested the coherent cache flushing, but that > > would also be a good question, is this iGPU cache incoherent with > > the CPU? Could this be a cache flushing bug? It is very hard to > > test that so it would not be such a surprise if it has a bug.. > > I had the chance to remotely access the test machine. The test device is > 00:02.0. It has a dedicated IOMMU with the capabilities listed below: > > CAP 0x08 0xc9de008cee690462 > ECAP 0x10 0x0012ca9a00f0ef5e > > ECAP.SMPWC=0, which means this IOMMU unit hardware has a non-coherent > page walker. Kernel v6.18-rc5 works, but when merge the changes in the > iommu/next branch, the test case failed with GPU hang.
Okay, so it probably is coherent walker related somehow.. > The PASID table entry with v6.18-rc5 kernel: > 0x00000001067fc000:0x0000000000000002:0x0000000000000049 > The PASID table entry with iommu-next kernel: > 0x0000000105a86000:0x0000000000000002:0x0000000000000049 > > They are the same, except for the page table pointer. Ok that's good > On another machine, I opt-out of the ECAP.SMPWC capability and find that > the clflush works for an idxd device as shown below: > > # dmesg | grep clflush_cache_range | grep "idxd 0000:00:02.0" > [ 45.199811] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff04eaaf000, 1000 > [ 45.200923] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff046352000, 8 > [ 45.202082] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff052bd3000, 1000 > [ 45.203184] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff04eaaf000, 8 > [ 45.204236] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff052bd2000, 1000 > [ 45.205318] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff052bd3018, 8 > [ 45.206370] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff052bd1000, 1000 > [ 45.207451] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff052bd2ff8, 8 > [ 45.208503] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff052bd1ff0, 8 > [ 45.209555] idxd 0000:00:02.0: DMAR: clflush_cache_range: > 0xffff9ff052bd1ff8, 8 > > It appears that new page table allocation, page table entry modification > are all followed by a clflush_cache_range(). Yeah, but maybe something missed? Or not working at all (like scrambled values or???) Is there any evidence flushing is working at all on the iGPU? Like it is totally broken or is there some rarer corner case that is not flushing? This is the least testable part of it so it is a good place to have a bug.. Hopefully today I can write you a little patch that will force flush everything, it will be really slow, but if it fixes the issue it would prove the cache flushing is at fault??? Jason
