Re: REGRESSION on linux-next (next-20251106)

Jason Gunthorpe Mon, 17 Nov 2025 17:30:01 -0800

On Mon, Nov 10, 2025 at 12:06:30PM +0530, Borah, Chaitanya Kumar wrote:
> Hello Jason,
> 
> Hope you are doing well. I am Chaitanya from the linux graphics team in
> Intel.
> 
> This mail is regarding a regression we are seeing in our CI runs[1] on
> linux-next repository.
> 
> Since the version next-20251106 [2], we are seeing our tests timing out
> presumably caused by a GPU Hang.
> 
> `````````````````````````````````````````````````````````````````````````````````
> <6> [490.872058] i915 0000:00:02.0: [drm] Got hung context on vcs0 with
> active request 939:2 [0x1004] not yet started
> <6> [490.875244] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff
> <7> [496.424189] i915 0000:00:02.0: [drm:intel_guc_context_reset_process_msg
> [i915]] GT1: GUC: Got context reset notification: 0x1004 on vcs0, exiting =
> no, banned = no
> <6> [496.921551] i915 0000:00:02.0: [drm] Got hung context on vcs0 with
> active request 939:2 [0x1004] not yet started
> <6> [496.924799] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff
> <4> [499.946641] [IGT] Per-test timeout exceeded. Killing the current test
> with SIGQUIT.
> `````````````````````````````````````````````````````````````````````````````````
> Details log can be found in [3].


Chaitanya, can you check these two debugging patches:

https://github.com/jgunthorpe/linux/commits/for-borah

10635ad3ff26a0 DEBUGGING: Force flush the whole cpu cache for the page table on 
every map operation
2789602b882499 DEBUGGING: Force flush the whole iotlb on every map operation

Please run a test with each of them applied *individually* and report
back what changes in the test. The "cpu cache" one may oops or
something, we are just looking to see if it gets past the error Kevin
pointed to:

<7>[   67.231149] [IGT] gem_exec_gttfill: starting subtest basic
[..]
<5>[   68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses
<3>[   68.825482] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to 
process request 6000 (-EOPNOTSUPP)

I could not test these patches so they may not work at all..

Also, I'd like to know if this is happening 100% reproducibly or of it
is flakey.. Also this is 68 after boot and right at the first test,
 and just to confirm this is 68s after boot and right after
starting a test so it looks like the test is just not working at all?

I'm still interested to know if there is an iommu error that is
somehow not getting into the log?

It would also help to collect the trace points:

int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova,
{
[..]
                trace_map(orig_iova, orig_paddr, orig_size);

And

static size_t __iommu_unmap(struct iommu_domain *domain,
                            unsigned long iova, size_t size,
                            struct iommu_iotlb_gather *iotlb_gather)
{
[..]
        trace_unmap(orig_iova, size, unmapped);

As well as some instrumentation for the IOVA involved with the above
error for request 6000.

Finally, it is interesting that this test prints this:

<5>[   68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses

Which comes from here:

        if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) {
                iova = alloc_iova_fast(iovad, iova_len,
                                       DMA_BIT_MASK(32) >> shift, false);
                if (iova)
                        goto done;

                dev->iommu->pci_32bit_workaround = false;
                dev_notice(dev, "Using %d-bit DMA addresses\n", 
bits_per(dma_limit));
        }

Which means dma-iommu has exceeded the 32 bit pool and is allocating
high addresses now? 

It prints that and then immediately fails? Seems like a clue!

Is there a failing map call perhaps due to the driver setting up the
wrong iova range for the table? iommpt is strict about enforcing the
IOVA limitation. A failing map call might product this outcome (though
I expect a iommu error log)

The map traces only log on success though, so please add a print on
failure too..

46 bits is not particularly big... Hmm, I wonder if we have some issue
with the sign-extend? iommupt does that properly and IIRC the old code
did not. Which of the page table formats is this using second stage or
first stage?

Kevin/Baolu any thoughts around the above?

Jason

Re: REGRESSION on linux-next (next-20251106)

Reply via email to