> From: Jason Gunthorpe <[email protected]> > Sent: Thursday, November 13, 2025 6:32 AM > > On Mon, Nov 10, 2025 at 12:06:30PM +0530, Borah, Chaitanya Kumar wrote: > > Hello Jason, > > > > Hope you are doing well. I am Chaitanya from the linux graphics team in > > Intel. > > > > This mail is regarding a regression we are seeing in our CI runs[1] on > > linux-next repository. > > > > Since the version next-20251106 [2], we are seeing our tests timing out > > presumably caused by a GPU Hang. > > Thank you for reporting this. > > I don't have any immediate theory, so I think it will need some > debug. Maybe Kevin or Lu have some idea? > > Some general thoughts to check > > 1) Is there an iommu fault report? I did not see one in your dmesg, > but maybe it was truncated? It is more puzzling to see an iommu > related error and not see a fault report.. > > 2) Could it be one of the special iommu behaviors to support iGPU that > is not working? Maybe we missed one? > > 3) I seem to recall Lu tested the coherent cache flushing, but that > would also be a good question, is this iGPU cache incoherent with > the CPU? Could this be a cache flushing bug? It is very hard to > test that so it would not be such a surprise if it has a bug.. > > 4) Nobody has reported any other problems so far, so I'm inclined to > think the map/unmap is working - but maybe there is some edge case > the gpu driver is tripping up on? > > The lack of a fault report is very puzzling, even if it was #3 I would > think a fault would be the most likely outcome of missing > flushing.. The lack of a fault report suggests the wrong physical > address was mapped as present which points to #4. > > Can you investigate a bit further and maybe see if we can get a bit > more detail what that GPU thinks went wrong? >
Below is probably the first error reported out from that long log: <7>[ 67.231149] [IGT] gem_exec_gttfill: starting subtest basic <7>[ 67.232334] i915 0000:00:02.0: [drm:i915_gem_open [i915]] <6>[ 67.233685] gem_exec_gttfil (1444): drop_caches: 4 <7>[ 67.233883] i915 0000:00:02.0: [drm:i915_drop_caches_set [i915]] Dropping caches: 0x00000070 [0x00000070] <7>[ 67.316847] i915 0000:00:02.0: [drm:intel_power_well_disable [i915]] disabling always-on <7>[ 67.793500] i915 0000:00:02.0: [drm:i915_drop_caches_set [i915]] Dropping caches: 0x00000070 [0x00000070] <7>[ 67.826268] i915 0000:00:02.0: [drm:i915_drop_caches_set [i915]] Dropping caches: 0x0000005c [0x0000005c] <7>[ 67.827484] i915 0000:00:02.0: [drm:intel_power_well_enable [i915]] enabling always-on <7>[ 67.828791] [drm:eb_validate_vma [i915]] EINVAL at eb_validate_vma:509 <5>[ 68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses <3>[ 68.825482] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to process request 6000 (-EOPNOTSUPP) <3>[ 68.825696] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to process CT message (-EOPNOTSUPP) 02 00 00 00 00 60 00 90 03 10 00 00 <3>[ 68.825790] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to process request 6000 (-EOPNOTSUPP) <3>[ 68.825839] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to process CT message (-EOPNOTSUPP) 02 00 00 00 00 60 00 90 03 10 00 00 <6>[ 68.825974] i915 0000:00:02.0: [drm] GT0: GUC: CTB is dead - reason=0x40 there is a buffer holding message from device, and the latest message contains an unsupported action number (0x6000) in ct_process_request(). Likely the mapping for that buffer is incorrect, either due to stale iotlb entry or map/unmap corner cases. I'm inclined to the former. Chaitanya, probably you can check whether it's always the same test case (gem_exec_gttfill) failing with the same type of error. If yes it'd be easier to further insert some debug code around that buffer.
