On Mon, Nov 10, 2025 at 12:06:30PM +0530, Borah, Chaitanya Kumar wrote: > Hello Jason, > > Hope you are doing well. I am Chaitanya from the linux graphics team in > Intel. > > This mail is regarding a regression we are seeing in our CI runs[1] on > linux-next repository. > > Since the version next-20251106 [2], we are seeing our tests timing out > presumably caused by a GPU Hang. > > ````````````````````````````````````````````````````````````````````````````````` > <6> [490.872058] i915 0000:00:02.0: [drm] Got hung context on vcs0 with > active request 939:2 [0x1004] not yet started > <6> [490.875244] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff > <7> [496.424189] i915 0000:00:02.0: [drm:intel_guc_context_reset_process_msg > [i915]] GT1: GUC: Got context reset notification: 0x1004 on vcs0, exiting = > no, banned = no > <6> [496.921551] i915 0000:00:02.0: [drm] Got hung context on vcs0 with > active request 939:2 [0x1004] not yet started > <6> [496.924799] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff > <4> [499.946641] [IGT] Per-test timeout exceeded. Killing the current test > with SIGQUIT. > ````````````````````````````````````````````````````````````````````````````````` > Details log can be found in [3].
Chaitanya, can you check these two debugging patches: https://github.com/jgunthorpe/linux/commits/for-borah 10635ad3ff26a0 DEBUGGING: Force flush the whole cpu cache for the page table on every map operation 2789602b882499 DEBUGGING: Force flush the whole iotlb on every map operation Please run a test with each of them applied *individually* and report back what changes in the test. The "cpu cache" one may oops or something, we are just looking to see if it gets past the error Kevin pointed to: <7>[ 67.231149] [IGT] gem_exec_gttfill: starting subtest basic [..] <5>[ 68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses <3>[ 68.825482] i915 0000:00:02.0: [drm] *ERROR* GT0: GUC: CT: Failed to process request 6000 (-EOPNOTSUPP) I could not test these patches so they may not work at all.. Also, I'd like to know if this is happening 100% reproducibly or of it is flakey.. Also this is 68 after boot and right at the first test, and just to confirm this is 68s after boot and right after starting a test so it looks like the test is just not working at all? I'm still interested to know if there is an iommu error that is somehow not getting into the log? It would also help to collect the trace points: int iommu_map_nosync(struct iommu_domain *domain, unsigned long iova, { [..] trace_map(orig_iova, orig_paddr, orig_size); And static size_t __iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size, struct iommu_iotlb_gather *iotlb_gather) { [..] trace_unmap(orig_iova, size, unmapped); As well as some instrumentation for the IOVA involved with the above error for request 6000. Finally, it is interesting that this test prints this: <5>[ 68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses Which comes from here: if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) { iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift, false); if (iova) goto done; dev->iommu->pci_32bit_workaround = false; dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit)); } Which means dma-iommu has exceeded the 32 bit pool and is allocating high addresses now? It prints that and then immediately fails? Seems like a clue! Is there a failing map call perhaps due to the driver setting up the wrong iova range for the table? iommpt is strict about enforcing the IOVA limitation. A failing map call might product this outcome (though I expect a iommu error log) The map traces only log on success though, so please add a print on failure too.. 46 bits is not particularly big... Hmm, I wonder if we have some issue with the sign-extend? iommupt does that properly and IIRC the old code did not. Which of the page table formats is this using second stage or first stage? Kevin/Baolu any thoughts around the above? Jason
