RE: REGRESSION on linux-next (next-20251106)

Tian, Kevin Wed, 12 Nov 2025 18:00:38 -0800

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, November 13, 2025 6:32 AM
> 
> On Mon, Nov 10, 2025 at 12:06:30PM +0530, Borah, Chaitanya Kumar wrote:
> > Hello Jason,
> >
> > Hope you are doing well. I am Chaitanya from the linux graphics team in
> > Intel.
> >
> > This mail is regarding a regression we are seeing in our CI runs[1] on
> > linux-next repository.
> >
> > Since the version next-20251106 [2], we are seeing our tests timing out
> > presumably caused by a GPU Hang.
> 
> Thank you for reporting this.
> 
> I don't have any immediate theory, so I think it will need some
> debug. Maybe Kevin or Lu have some idea?
> 
> Some general thoughts to check
> 
> 1) Is there an iommu fault report? I did not see one in your dmesg,
>    but maybe it was truncated? It is more puzzling to see an iommu
>    related error and not see a fault report..
> 
> 2) Could it be one of the special iommu behaviors to support iGPU that
>    is not working? Maybe we missed one?
> 
> 3) I seem to recall Lu tested the coherent cache flushing, but that
>    would also be a good question, is this iGPU cache incoherent with
>    the CPU? Could this be a cache flushing bug? It is very hard to
>    test that so it would not be such a surprise if it has a bug..
> 
> 4) Nobody has reported any other problems so far, so I'm inclined to
>    think the map/unmap is working - but maybe there is some edge case
>    the gpu driver is tripping up on?
> 
> The lack of a fault report is very puzzling, even if it was #3 I would
> think a fault would be the most likely outcome of missing
> flushing.. The lack of a fault report suggests the wrong physical
> address was mapped as present which points to #4.
> 
> Can you investigate a bit further and maybe see if we can get a bit
> more detail what that GPU thinks went wrong?
>


Below is probably the first error reported out from that long log:

<7>[   67.231149] [IGT] gem_exec_gttfill: starting subtest basic
<7>[   67.232334] i915 0000:00:02.0: [drm:i915_gem_open [i915]] 
<6>[   67.233685] gem_exec_gttfil (1444): drop_caches: 4
<7>[   67.233883] i915 0000:00:02.0: [drm:i915_drop_caches_set [i915]] Dropping 
caches: 0x00000070 [0x00000070]
<7>[   67.316847] i915 0000:00:02.0: [drm:intel_power_well_disable [i915]] 
disabling always-on
<7>[   67.793500] i915 0000:00:02.0: [drm:i915_drop_caches_set [i915]] Dropping 
caches: 0x00000070 [0x00000070]
<7>[   67.826268] i915 0000:00:02.0: [drm:i915_drop_caches_set [i915]] Dropping 
caches: 0x0000005c [0x0000005c]
<7>[   67.827484] i915 0000:00:02.0: [drm:intel_power_well_enable [i915]] 
enabling always-on
<7>[   67.828791] [drm:eb_validate_vma [i915]] EINVAL at eb_validate_vma:509
<5>[   68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses
<3>[   68.825482] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to 
process request 6000 (-EOPNOTSUPP)
<3>[   68.825696] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to 
process CT message (-EOPNOTSUPP) 02 00 00 00 00 60 00 90 03 10 00 00
<3>[   68.825790] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to 
process request 6000 (-EOPNOTSUPP)
<3>[   68.825839] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to 
process CT message (-EOPNOTSUPP) 02 00 00 00 00 60 00 90 03 10 00 00
<6>[   68.825974] i915 0000:00:02.0: [drm] GT0: GUC: CTB is dead - reason=0x40

there is a buffer holding message from device, and the latest message
contains an unsupported action number (0x6000) in ct_process_request().

Likely the mapping for that buffer is incorrect, either due to stale iotlb
entry or map/unmap corner cases. I'm inclined to the former.

Chaitanya, probably you can check whether it's always the same
test case (gem_exec_gttfill) failing with the same type of error.

If yes it'd be easier to further insert some debug code around that
buffer.

RE: REGRESSION on linux-next (next-20251106)

Reply via email to