On 11/18/2025 9:29 AM, Jason Gunthorpe wrote:
On Mon, Nov 10, 2025 at 12:06:30PM +0530, Borah, Chaitanya Kumar wrote:
Hello Jason,
Hope you are doing well. I am Chaitanya from the linux graphics team in
Intel.
This mail is regarding a regression we are seeing in our CI runs[1] on
linux-next repository.
Since the version next-20251106 [2], we are seeing our tests timing out
presumably caused by a GPU Hang.
`````````````````````````````````````````````````````````````````````````````````
<6> [490.872058] i915 0000:00:02.0: [drm] Got hung context on vcs0 with
active request 939:2 [0x1004] not yet started
<6> [490.875244] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff
<7> [496.424189] i915 0000:00:02.0: [drm:intel_guc_context_reset_process_msg
[i915]] GT1: GUC: Got context reset notification: 0x1004 on vcs0, exiting =
no, banned = no
<6> [496.921551] i915 0000:00:02.0: [drm] Got hung context on vcs0 with
active request 939:2 [0x1004] not yet started
<6> [496.924799] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:baffffff
<4> [499.946641] [IGT] Per-test timeout exceeded. Killing the current test
with SIGQUIT.
`````````````````````````````````````````````````````````````````````````````````
Details log can be found in [3].
Chaitanya, can you check these two debugging patches:
https://github.com/jgunthorpe/linux/commits/for-borah
10635ad3ff26a0 DEBUGGING: Force flush the whole cpu cache for the page table on
every map operation
2789602b882499 DEBUGGING: Force flush the whole iotlb on every map operation
Please run a test with each of them applied*individually* and report
back what changes in the test. The "cpu cache" one may oops or
something, we are just looking to see if it gets past the error Kevin
pointed to:
<7>[ 67.231149] [IGT] gem_exec_gttfill: starting subtest basic
[..]
<5>[ 68.824598] i915 0000:00:02.0: Using 46-bit DMA addresses
<3>[ 68.825482] i915 0000:00:02.0: [drm]*ERROR* GT0: GUC: CT: Failed to
process request 6000 (-EOPNOTSUPP)
I could not test these patches so they may not work at all..
I applied and tested both debugging patches separately, but the failures
persist. And I also tried to flush all TLB caches by adding
flush_tlb_all() in the iommu mapping path. It doesn't help either.
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 2d2f64ce2bc6..59a00235032b 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3484,6 +3484,8 @@ static int intel_iommu_iotlb_sync_map(struct
iommu_domain *domain,
{
struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+ flush_tlb_all();
+
if (dmar_domain->iotlb_sync_map)
cache_tag_flush_range_np(dmar_domain, iova, iova + size
- 1);
Thanks,
baolu