Re: [PATCH v2] drm/i915: Increase FLR timeout from 3s to 9s
On 5/24/2024 1:58 AM, Andi Shyti wrote: Following the guidelines it takes 3 seconds to perform an FLR reset. Let's give it a bit more slack because this time can change depending on the platform and on the firmware Signed-off-by: Andi Shyti Reviewed-by: Nirmoy Das --- Hi, In this second version I removed patch 2 that was ignoring the FLR reset timeouts, until we develop a proper patch. This first patch is basically the same as v1. Thanks Nirmoy for your review. Andi drivers/gpu/drm/i915/intel_uncore.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 729409a4bada..2eba289d88ad 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -2614,11 +2614,18 @@ void intel_uncore_prune_engine_fw_domains(struct intel_uncore *uncore, static void driver_initiated_flr(struct intel_uncore *uncore) { struct drm_i915_private *i915 = uncore->i915; - const unsigned int flr_timeout_ms = 3000; /* specs recommend a 3s wait */ + unsigned int flr_timeout_ms; int ret; drm_dbg(>drm, "Triggering Driver-FLR\n"); + /* +* The specification recommends a 3 seconds FLR reset timeout. To be +* cautious, we will extend this to 9 seconds, three times the specified +* timeout. +*/ + flr_timeout_ms = 9000; + /* * Make sure any pending FLR requests have cleared by waiting for the * FLR trigger bit to go to zero. Also clear GU_DEBUG's DRIVERFLR_STATUS
Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors
Hi Andi, On 5/21/2024 12:56 PM, Andi Shyti wrote: Hi Nirmoy, On Fri, May 17, 2024 at 10:13:37PM +0200, Nirmoy Das wrote: Hi Andi, On 5/17/2024 9:34 PM, Andi Shyti wrote: Hi Nirmoy, On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote: On 5/17/2024 1:25 PM, Andi Shyti wrote: If we timeout while waiting for an FLR reset, there is nothing we can do and i915 doesn't have any control on it. In any case the system is still perfectly usable If a FLR reset fails then we will have a dead GPU, I don't think the GPU is usable without a cold reboot. fact is that the GPU keeps going and even though the timeout has expired, the system moves to the next phase. The current test might look like it is has passed, but if you look into the subsequent tests you can see a dead GPU: <7>[ 369.168121] pci :00:02.0: [drm:intel_uncore_fini_mmio [i915]] Triggering Driver-FLR <3>[ 372.170189] pci :00:02.0: [drm] *ERROR* Driver-FLR-teardown wait completion failed! -110 <7>[ 372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS <7>[ 372.438356] [IGT] i915_selftest: starting dynamic subtest migrate <5>[ 373.110580] Setting dangerous option live_selftests - tainting kernel <3>[ 373.183499] i915 :00:02.0: Unable to change power state from D0 to D0, device inaccessible <3>[ 373.246921] i915 :00:02.0: [drm] *ERROR* Unrecognized display IP version 1023.255; disabling display. <7>[ 373.247130] i915 :00:02.0: [drm:intel_step_init [i915]] Using future steppings <7>[ 373.247716] i915 :00:02.0: [drm:intel_step_init [i915]] Using future steppings <7>[ 373.248263] i915 :00:02.0: [drm:intel_step_init [i915]] Using future display steppings <7>[ 373.251843] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] WOPCM: 2048K <7>[ 373.252505] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT0: enable_guc=3 (guc:yes submission:yes huc:no slpc:yes) <7>[ 373.253140] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT0: Setting up Primary GT <7>[ 373.253556] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT1: Setting up Standalone Media GT <7>[ 373.253941] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] WOPCM: 2048K <7>[ 373.254365] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT1: enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes) <3>[ 375.256235] i915 :00:02.0: [drm] *ERROR* Device is non-operational; MMIO access returns 0x! <3>[ 375.259089] i915 :00:02.0: Device initialization failed (-5) <3>[ 375.260521] i915 :00:02.0: probe with driver i915 failed with error -5 <7>[ 375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt Are we sure this is dependent on the FLR reset? Yes, while on FLR read into memory will return either 0/F. There are cases when the FLR reset doesn't make any difference and in any case this error is completely ignored by the driver. This happens at very late with no recovery possible and hope is module reload works. Perhaps we can change it to a warning? I think it should be error. CI will still complain even on warning. This is a serious issue and should be report as an error. I think we need to create a HW ticket to understand why is FLR reset fails. Maybe it takes longer and longer to reset. We've been sending several patches in the latest years to fix the timings. HW spec says 3 sec but we can try increasing it bit higher to try it out. We could go, then, with just patch 1 and see if it improves. Does it help ? If helps then we can go ahead with increased timeout. Also because, the FLR reset might also depend on the firmware. Possible. In that case we should wait for firmware fix ? Regards, Nirmoy Thanks, Nirmoy, Andi
Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors
Hi Andi, On 5/17/2024 9:34 PM, Andi Shyti wrote: Hi Nirmoy, On Fri, May 17, 2024 at 04:00:02PM +0200, Nirmoy Das wrote: On 5/17/2024 1:25 PM, Andi Shyti wrote: If we timeout while waiting for an FLR reset, there is nothing we can do and i915 doesn't have any control on it. In any case the system is still perfectly usable If a FLR reset fails then we will have a dead GPU, I don't think the GPU is usable without a cold reboot. fact is that the GPU keeps going and even though the timeout has expired, the system moves to the next phase. The current test might look like it is has passed, but if you look into the subsequent tests you can see a dead GPU: <7>[ 369.168121] pci :00:02.0: [drm:intel_uncore_fini_mmio [i915]] Triggering Driver-FLR *<3>[ 372.170189] pci :00:02.0: [drm] *ERROR* Driver-FLR-teardown wait completion failed! -110* *<7>[ 372.437630] [IGT] i915_selftest: finished subtest requests, SUCCESS* <7>[ 372.438356] [IGT] i915_selftest: starting dynamic subtest migrate <5>[ 373.110580] Setting dangerous option live_selftests - tainting kernel <3>[ 373.183499] i915 :00:02.0: Unable to change power state from D0 to D0, device inaccessible <3>[ 373.246921] i915 :00:02.0: [drm] *ERROR* Unrecognized display IP version 1023.255; disabling display. <7>[ 373.247130] i915 :00:02.0: [drm:intel_step_init [i915]] Using future steppings <7>[ 373.247716] i915 :00:02.0: [drm:intel_step_init [i915]] Using future steppings <7>[ 373.248263] i915 :00:02.0: [drm:intel_step_init [i915]] Using future display steppings <7>[ 373.251843] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] WOPCM: 2048K <7>[ 373.252505] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT0: enable_guc=3 (guc:yes submission:yes huc:no slpc:yes) <7>[ 373.253140] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT0: Setting up Primary GT <7>[ 373.253556] i915 :00:02.0: [drm:intel_gt_probe_all [i915]] GT1: Setting up Standalone Media GT <7>[ 373.253941] i915 :00:02.0: [drm:intel_gt_common_init_early [i915]] WOPCM: 2048K <7>[ 373.254365] i915 :00:02.0: [drm:intel_uc_init_early [i915]] GT1: enable_guc=3 (guc:yes submission:yes huc:yes slpc:yes) *<3>[ 375.256235] i915 :00:02.0: [drm] *ERROR* Device is non-operational; MMIO access returns 0x!* <3>[ 375.259089] i915 :00:02.0: Device initialization failed (-5) <3>[ 375.260521] i915 :00:02.0: probe with driver i915 failed with error -5 <7>[ 375.392209] [IGT] i915_selftest: finished subtest migrate, FAIL https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14724/bat-arls-3/dmesg0.txt This is a serious issue and should be report as an error. I think we need to create a HW ticket to understand why is FLR reset fails. Maybe it takes longer and longer to reset. We've been sending several patches in the latest years to fix the timings. HW spec says 3 sec but we can try increasing it bit higher to try it out. Regards, Nirmoy Andi
Re: [PATCH 2/2] drm/i915: Don't treat FLR resets as errors
Hi Andi, On 5/17/2024 1:25 PM, Andi Shyti wrote: If we timeout while waiting for an FLR reset, there is nothing we can do and i915 doesn't have any control on it. In any case the system is still perfectly usable If a FLR reset fails then we will have a dead GPU, I don't think the GPU is usable without a cold reboot. This is a serious issue and should be report as an error. I think we need to create a HW ticket to understand why is FLR reset fails. Regards, Nirmoy and the function returns void. We don't need to be alarmed, therefore, print the timeout expiration as a debug message instead of an error. Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/10955 Signed-off-by: Andi Shyti --- drivers/gpu/drm/i915/intel_uncore.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 2eba289d88ad..a3fa2ed91aae 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -2637,7 +2637,7 @@ static void driver_initiated_flr(struct intel_uncore *uncore) */ ret = intel_wait_for_register_fw(uncore, GU_CNTL, DRIVERFLR, 0, flr_timeout_ms); if (ret) { - drm_err(>drm, + drm_dbg(>drm, "Failed to wait for Driver-FLR bit to clear! %d\n", ret); return; @@ -2652,7 +2652,7 @@ static void driver_initiated_flr(struct intel_uncore *uncore) DRIVERFLR, 0, flr_timeout_ms); if (ret) { - drm_err(>drm, "Driver-FLR-teardown wait completion failed! %d\n", ret); + drm_dbg(>drm, "Driver-FLR-teardown wait completion failed! %d\n", ret); return; } @@ -2661,7 +2661,7 @@ static void driver_initiated_flr(struct intel_uncore *uncore) DRIVERFLR_STATUS, DRIVERFLR_STATUS, flr_timeout_ms); if (ret) { - drm_err(>drm, "Driver-FLR-reinit wait completion failed! %d\n", ret); + drm_dbg(>drm, "Driver-FLR-reinit wait completion failed! %d\n", ret); return; }
Re: [PATCH] drm/i915/selftests: Set always_coherent to false when reading from CPU
On 5/17/2024 1:53 PM, Jani Nikula wrote: On Fri, 17 May 2024, Nirmoy Das wrote: Hi Jani, On 5/17/2024 9:39 AM, Jani Nikula wrote: On Thu, 16 May 2024, Nirmoy Das wrote: The previous commit 'commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick "previous commit" is a fairly vague reference once this gets committed. It's not going to be "previous" in any meaningful sense. Please just start with: Commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.") was not complete... Will do that. And probably add: Fixes: 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.") Do we need Fixes for selftest ? I always assumed it is not required as this code is for debug/CI Maybe not for stuff that's already in stable, but we do run CI on drm-next and -rc kernels, and if this causes issues there, why not have them fixed? Not sure a commit with Fixes flows from drm-intel-next to drm-next/-rc but I see no issue adding Fixes without CC-ing to stable. Pushed it to drm-intel-next with above modifications. b4-shazam picked Fixes as well which was nice. Thanks, Nirmoy BR, Jani. Thanks, Nirmoy BR, Jani. correct caching mode.")' was not complete as for non LLC sharing platforms cpu read can happen from LLC which probably doesn't have the latest changes made by GPU. Cc: Andi Shyti Cc: Janusz Krzysztofik Cc: Jonathan Cavitt Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c index 65a931ea80e9..3527b8f446fe 100644 --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c @@ -196,7 +196,7 @@ static int verify_access(struct drm_i915_private *i915, if (err) goto out_file; - mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true); + mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, false); vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode); if (IS_ERR(vaddr)) { err = PTR_ERR(vaddr);
Re: [PATCH] drm/i915/selftests: Set always_coherent to false when reading from CPU
Hi Jani, On 5/17/2024 9:39 AM, Jani Nikula wrote: On Thu, 16 May 2024, Nirmoy Das wrote: The previous commit 'commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick "previous commit" is a fairly vague reference once this gets committed. It's not going to be "previous" in any meaningful sense. Please just start with: Commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.") was not complete... Will do that. And probably add: Fixes: 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.") Do we need Fixes for selftest ? I always assumed it is not required as this code is for debug/CI Thanks, Nirmoy BR, Jani. correct caching mode.")' was not complete as for non LLC sharing platforms cpu read can happen from LLC which probably doesn't have the latest changes made by GPU. Cc: Andi Shyti Cc: Janusz Krzysztofik Cc: Jonathan Cavitt Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c index 65a931ea80e9..3527b8f446fe 100644 --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c @@ -196,7 +196,7 @@ static int verify_access(struct drm_i915_private *i915, if (err) goto out_file; - mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true); + mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, false); vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode); if (IS_ERR(vaddr)) { err = PTR_ERR(vaddr);
[PATCH] drm/i915/selftests: Set always_coherent to false when reading from CPU
The previous commit 'commit 8d4ba9fc1c6c ("drm/i915/selftests: Pick correct caching mode.")' was not complete as for non LLC sharing platforms cpu read can happen from LLC which probably doesn't have the latest changes made by GPU. Cc: Andi Shyti Cc: Janusz Krzysztofik Cc: Jonathan Cavitt Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c index 65a931ea80e9..3527b8f446fe 100644 --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c @@ -196,7 +196,7 @@ static int verify_access(struct drm_i915_private *i915, if (err) goto out_file; - mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true); + mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, false); vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode); if (IS_ERR(vaddr)) { err = PTR_ERR(vaddr); -- 2.42.0
[PATCH] drm/i915: Use for_each_child instead of manual for-loop
Simplify child iteration using for_each_child macro instead of using manual for loop. There is no functional change. Cc: John Harrison Cc: Tvrtko Ursulin Signed-off-by: Nirmoy Das --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 64 ++- 1 file changed, 33 insertions(+), 31 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 0eaa1064242c..7e88d90e935b 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1800,14 +1800,37 @@ __unwind_incomplete_requests(struct intel_context *ce) spin_unlock_irqrestore(_engine->lock, flags); } -static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t stalled) +static void guc_reset_context_state(struct intel_context *ce, intel_engine_mask_t stalled) { - bool guilty; struct i915_request *rq; - unsigned long flags; + bool guilty = false; u32 head; - int i, number_children = ce->parallel.number_children; - struct intel_context *parent = ce; + + if (!intel_context_is_pinned(ce)) + return; + + rq = intel_context_get_active_request(ce); + if (!rq) { + head = ce->ring->tail; + goto out_replay; + } + + if (i915_request_started(rq)) + guilty = stalled & ce->engine->mask; + + GEM_BUG_ON(i915_active_is_idle(>active)); + head = intel_ring_wrap(ce->ring, rq->head); + + __i915_request_reset(rq, guilty); + i915_request_put(rq); +out_replay: + guc_reset_state(ce, head, guilty); +} + +static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t stalled) +{ + struct intel_context *child; + unsigned long flags; GEM_BUG_ON(intel_context_is_child(ce)); @@ -1826,34 +1849,13 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st * For each context in the relationship find the hanging request * resetting each context / request as needed */ - for (i = 0; i < number_children + 1; ++i) { - if (!intel_context_is_pinned(ce)) - goto next_context; - - guilty = false; - rq = intel_context_get_active_request(ce); - if (!rq) { - head = ce->ring->tail; - goto out_replay; - } - - if (i915_request_started(rq)) - guilty = stalled & ce->engine->mask; - - GEM_BUG_ON(i915_active_is_idle(>active)); - head = intel_ring_wrap(ce->ring, rq->head); - - __i915_request_reset(rq, guilty); - i915_request_put(rq); -out_replay: - guc_reset_state(ce, head, guilty); -next_context: - if (i != number_children) - ce = list_next_entry(ce, parallel.child_link); + guc_reset_context_state(ce, stalled); + for_each_child(ce, child) { + guc_reset_context_state(child, stalled); } - __unwind_incomplete_requests(parent); - intel_context_put(parent); + __unwind_incomplete_requests(ce); + intel_context_put(ce); } void wake_up_all_tlb_invalidate(struct intel_guc *guc) -- 2.42.0
Re: [PATCH] drm/i915: Correct error handler
On 5/11/2024 5:48 PM, Jiasheng Jiang wrote: Replace "slab_priorities" with "slab_dependencies" in the error handler to avoid memory leak. Nice catch. I would make the subject more like: drm/i915: Fix memory leak by correcting cache object name in error handler Fixes: 32eb6bcfdda9 ("drm/i915: Make request allocation caches global") Also need Cc: # v5.2+ With those: Reviewed-by: Nirmoy Das Nirmoy Signed-off-by: Jiasheng Jiang --- drivers/gpu/drm/i915/i915_scheduler.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c index 762127dd56c5..70a854557e6e 100644 --- a/drivers/gpu/drm/i915/i915_scheduler.c +++ b/drivers/gpu/drm/i915/i915_scheduler.c @@ -506,6 +506,6 @@ int __init i915_scheduler_module_init(void) return 0; err_priorities: - kmem_cache_destroy(slab_priorities); + kmem_cache_destroy(slab_dependencies); return -ENOMEM; }
Re: [PATCH] Revert "drm/i915: Remove extra multi-gt pm-references"
On 5/7/2024 7:10 PM, Rodrigo Vivi wrote: On Tue, May 07, 2024 at 10:54:11AM +0200, Janusz Krzysztofik wrote: On Tuesday, 7 May 2024 09:30:15 GMT+2 Nirmoy Das wrote: Hi Janusz, Just realized we need Fixes tag for this. Fixes: 1f33dc0c1189 ("drm/i915: Remove extra multi-gt pm-references") Whoever is going to push this patch, please feel free to add this tag. dim b4-shazam gets that automagically, now it was sent in reply ;) Nice! I just pushed the patch. thanks for the patch and reviews. Thanks, Nirmoy Thanks, Janusz Regards, Nirmoy On 5/6/2024 8:02 PM, Janusz Krzysztofik wrote: This reverts commit 1f33dc0c1189efb9ae19c6fc22b64dd3e26261fb. There was a patch supposed to fix an issue of illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle, reported by CI on 2-GT Meteor Lake. As a solution, an extra wakeref for a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, that fix occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. Moreover, CI reports indicated that single-GT platforms also suffered sporadically from the same race. Since that issue was fixed by another commit f3c71b2ded5c ("drm/i915/vma: Fix UAF on destroy against retire race"), the changes introduced by that insufficient fix were dropped as no longer useful. However, that series resulted in another VMA UAF scenario now being triggered in CI. <4> [260.290809] [ cut here ] <4> [260.290988] list_del corruption. prev->next should be 888118c5d990, but was 888118c5a510. (prev=888118c5a510) <4> [260.291004] WARNING: CPU: 2 PID: 1143 at lib/list_debug.c:62 __list_del_entry_valid_or_report+0xb7/0xe0 .. <4> [260.291055] CPU: 2 PID: 1143 Comm: kms_plane Not tainted 6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1 <4> [260.291058] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024 <4> [260.291060] RIP: 0010:__list_del_entry_valid_or_report+0xb7/0xe0 ... <4> [260.291087] Call Trace: <4> [260.291089] <4> [260.291124] i915_vma_reopen+0x43/0x80 [i915] <4> [260.291298] eb_lookup_vmas+0x9cb/0xcc0 [i915] <4> [260.291579] i915_gem_do_execbuffer+0xc9a/0x26d0 [i915] <4> [260.291883] i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915] ... <4> [260.292301] ... <4> [260.292506] ---[ end trace ]--- <4> [260.292782] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6ca3: [#1] PREEMPT SMP NOPTI <4> [260.303575] CPU: 2 PID: 1143 Comm: kms_plane Tainted: GW 6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1 <4> [260.313851] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024 <4> [260.326359] RIP: 0010:eb_validate_vmas+0x114/0xd80 [i915] ... <4> [260.428756] Call Trace: <4> [260.431192] <4> [639.283393] i915_gem_do_execbuffer+0xd05/0x26d0 [i915] <4> [639.305245] i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915] ... <4> [639.411134] ... <4> [639.449979] ---[ end trace ]--- We defer actually closing, unbinding and destroying a VMA until next idle point, or until the object is freed in the meantime. By postponing the unbind, we allow for the VMA to be reopened by the client, avoiding the work required to rebind the VMA. Starting from commit b0647a5e79b1 ("drm/i915: Avoid live-lock with i915_vma_parked()"), we assume that as long as a GT is held idle, no VMA would be reopened while we destroy them. That assumption is no longer true in multi-GT configurations, where a VMA we reopen may be handled by a GT different from the one that we already keep active via its engine while we set up an execbuf request. Restoring the extra GT0 PM wakeref removed from i915_gem_do_execbuffer() processing path seems to fix this issue. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10608 Signed-off-by: Janusz Krzysztofik Cc: Rodrigo Vivi Cc: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 42619fc05de48..090724fa766c9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -255,6 +255,7 @@ struct i915_execbuffer { struct intel_context *context; /* logical state for the request */ struct i915_gem_context *gem_context; /** caller's context */
Re: [PATCH] Revert "drm/i915: Remove extra multi-gt pm-references"
Hi Janusz, Just realized we need Fixes tag for this. Fixes: 1f33dc0c1189 ("drm/i915: Remove extra multi-gt pm-references") Regards, Nirmoy On 5/6/2024 8:02 PM, Janusz Krzysztofik wrote: This reverts commit 1f33dc0c1189efb9ae19c6fc22b64dd3e26261fb. There was a patch supposed to fix an issue of illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle, reported by CI on 2-GT Meteor Lake. As a solution, an extra wakeref for a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, that fix occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. Moreover, CI reports indicated that single-GT platforms also suffered sporadically from the same race. Since that issue was fixed by another commit f3c71b2ded5c ("drm/i915/vma: Fix UAF on destroy against retire race"), the changes introduced by that insufficient fix were dropped as no longer useful. However, that series resulted in another VMA UAF scenario now being triggered in CI. <4> [260.290809] [ cut here ] <4> [260.290988] list_del corruption. prev->next should be 888118c5d990, but was 888118c5a510. (prev=888118c5a510) <4> [260.291004] WARNING: CPU: 2 PID: 1143 at lib/list_debug.c:62 __list_del_entry_valid_or_report+0xb7/0xe0 .. <4> [260.291055] CPU: 2 PID: 1143 Comm: kms_plane Not tainted 6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1 <4> [260.291058] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024 <4> [260.291060] RIP: 0010:__list_del_entry_valid_or_report+0xb7/0xe0 ... <4> [260.291087] Call Trace: <4> [260.291089] <4> [260.291124] i915_vma_reopen+0x43/0x80 [i915] <4> [260.291298] eb_lookup_vmas+0x9cb/0xcc0 [i915] <4> [260.291579] i915_gem_do_execbuffer+0xc9a/0x26d0 [i915] <4> [260.291883] i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915] ... <4> [260.292301] ... <4> [260.292506] ---[ end trace ]--- <4> [260.292782] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6ca3: [#1] PREEMPT SMP NOPTI <4> [260.303575] CPU: 2 PID: 1143 Comm: kms_plane Tainted: GW 6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1 <4> [260.313851] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024 <4> [260.326359] RIP: 0010:eb_validate_vmas+0x114/0xd80 [i915] ... <4> [260.428756] Call Trace: <4> [260.431192] <4> [639.283393] i915_gem_do_execbuffer+0xd05/0x26d0 [i915] <4> [639.305245] i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915] ... <4> [639.411134] ... <4> [639.449979] ---[ end trace ]--- We defer actually closing, unbinding and destroying a VMA until next idle point, or until the object is freed in the meantime. By postponing the unbind, we allow for the VMA to be reopened by the client, avoiding the work required to rebind the VMA. Starting from commit b0647a5e79b1 ("drm/i915: Avoid live-lock with i915_vma_parked()"), we assume that as long as a GT is held idle, no VMA would be reopened while we destroy them. That assumption is no longer true in multi-GT configurations, where a VMA we reopen may be handled by a GT different from the one that we already keep active via its engine while we set up an execbuf request. Restoring the extra GT0 PM wakeref removed from i915_gem_do_execbuffer() processing path seems to fix this issue. Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10608 Signed-off-by: Janusz Krzysztofik Cc: Rodrigo Vivi Cc: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 42619fc05de48..090724fa766c9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -255,6 +255,7 @@ struct i915_execbuffer { struct intel_context *context; /* logical state for the request */ struct i915_gem_context *gem_context; /** caller's context */ intel_wakeref_t wakeref; + intel_wakeref_t wakeref_gt0; /** our requests to build */ struct i915_request *requests[MAX_ENGINE_INSTANCE + 1]; @@ -2685,6 +2686,7 @@ static int eb_select_engine(struct i915_execbuffer *eb) { struct intel_context *ce, *child; + struct intel_gt *gt; unsigned int idx; int err; @@ -2708,10 +2710,17 @@ eb_sel
Re: [PATCH] Revert "drm/i915: Remove extra multi-gt pm-references"
On 5/6/2024 8:02 PM, Janusz Krzysztofik wrote: This reverts commit 1f33dc0c1189efb9ae19c6fc22b64dd3e26261fb. There was a patch supposed to fix an issue of illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle, reported by CI on 2-GT Meteor Lake. As a solution, an extra wakeref for a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, that fix occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. Moreover, CI reports indicated that single-GT platforms also suffered sporadically from the same race. Since that issue was fixed by another commit f3c71b2ded5c ("drm/i915/vma: Fix UAF on destroy against retire race"), the changes introduced by that insufficient fix were dropped as no longer useful. However, that series resulted in another VMA UAF scenario now being triggered in CI. <4> [260.290809] [ cut here ] <4> [260.290988] list_del corruption. prev->next should be 888118c5d990, but was 888118c5a510. (prev=888118c5a510) <4> [260.291004] WARNING: CPU: 2 PID: 1143 at lib/list_debug.c:62 __list_del_entry_valid_or_report+0xb7/0xe0 .. <4> [260.291055] CPU: 2 PID: 1143 Comm: kms_plane Not tainted 6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1 <4> [260.291058] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024 <4> [260.291060] RIP: 0010:__list_del_entry_valid_or_report+0xb7/0xe0 ... <4> [260.291087] Call Trace: <4> [260.291089] <4> [260.291124] i915_vma_reopen+0x43/0x80 [i915] <4> [260.291298] eb_lookup_vmas+0x9cb/0xcc0 [i915] <4> [260.291579] i915_gem_do_execbuffer+0xc9a/0x26d0 [i915] <4> [260.291883] i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915] ... <4> [260.292301] ... <4> [260.292506] ---[ end trace ]--- <4> [260.292782] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6ca3: [#1] PREEMPT SMP NOPTI <4> [260.303575] CPU: 2 PID: 1143 Comm: kms_plane Tainted: GW 6.9.0-rc2-CI_DRM_14524-ga25d180c6853+ #1 <4> [260.313851] Hardware name: Intel Corporation Meteor Lake Client Platform/MTL-P LP5x T3 RVP, BIOS MTLPFWI1.R00.3471.D91.2401310918 01/31/2024 <4> [260.326359] RIP: 0010:eb_validate_vmas+0x114/0xd80 [i915] ... <4> [260.428756] Call Trace: <4> [260.431192] <4> [639.283393] i915_gem_do_execbuffer+0xd05/0x26d0 [i915] <4> [639.305245] i915_gem_execbuffer2_ioctl+0x123/0x2a0 [i915] ... <4> [639.411134] ... <4> [639.449979] ---[ end trace ]--- We defer actually closing, unbinding and destroying a VMA until next idle point, or until the object is freed in the meantime. By postponing the unbind, we allow for the VMA to be reopened by the client, avoiding the work required to rebind the VMA. Starting from commit b0647a5e79b1 ("drm/i915: Avoid live-lock with i915_vma_parked()"), we assume that as long as a GT is held idle, no VMA would be reopened while we destroy them. That assumption is no longer true in multi-GT configurations, where a VMA we reopen may be handled by a GT different from the one that we already keep active via its engine while we set up an execbuf request. Restoring the extra GT0 PM wakeref removed from i915_gem_do_execbuffer() processing path seems to fix this issue. Closes:https://gitlab.freedesktop.org/drm/intel/-/issues/10608 Signed-off-by: Janusz Krzysztofik Cc: Rodrigo Vivi Cc: Nirmoy Das Reviewed-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 42619fc05de48..090724fa766c9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -255,6 +255,7 @@ struct i915_execbuffer { struct intel_context *context; /* logical state for the request */ struct i915_gem_context *gem_context; /** caller's context */ intel_wakeref_t wakeref; + intel_wakeref_t wakeref_gt0; /** our requests to build */ struct i915_request *requests[MAX_ENGINE_INSTANCE + 1]; @@ -2685,6 +2686,7 @@ static int eb_select_engine(struct i915_execbuffer *eb) { struct intel_context *ce, *child; + struct intel_gt *gt; unsigned int idx; int err; @@ -2708,10 +2710,17 @@ eb_select_engine(struct i915_execbuffer *eb) } } eb->num_batches = ce->parallel.number_children + 1; + gt = ce->eng
Re: [PATCH v3] drm/i915/vma: Fix UAF on reopen vs destroy race
aring after I reverted that workaround. However, its effectiveness is limited to MTL topology. perhaps the safer path for this case indeed. something that could be really limited to a single platform would be better. I agree with Rodrigo here. it would be safe revert the mentioned patch now and think about more robust solution later on as the issue is effecting current user. Regards, Nirmoy But I confess that I don't have other better suggestions. If we need to go with this patch as a quick solution, it is apparently better than leaving the bug there as is. +Thomas. any good thoughts there or advices? Thanks, Rodrigo. Thanks, Janusz Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10608 Signed-off-by: Janusz Krzysztofik Cc: Chris Wilson Cc: Tvrtko Ursulin Cc: sta...@vger.kernel.org # v6.0+ --- .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 10 -- drivers/gpu/drm/i915/i915_vma.c | 32 +++ drivers/gpu/drm/i915/i915_vma.h | 2 +- drivers/gpu/drm/i915/i915_vma_types.h | 3 ++ 4 files changed, 37 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 42619fc05de48..97e014f94002e 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -847,9 +847,12 @@ static int __eb_add_lut(struct i915_execbuffer *eb, if (unlikely(!lut)) return -ENOMEM; + if (!i915_vma_open(vma)) { + err = -EEXIST; /* let eb_vma_lookup() retry */ + goto err_lut_free; + } + i915_vma_get(vma); - if (!atomic_fetch_inc(>open_count)) - i915_vma_reopen(vma); lut->handle = handle; lut->ctx = ctx; @@ -880,8 +883,9 @@ static int __eb_add_lut(struct i915_execbuffer *eb, return 0; err: - i915_vma_close(vma); i915_vma_put(vma); + i915_vma_close(vma); +err_lut_free: i915_lut_handle_free(lut); return err; } diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index d2f064d2525cc..4435c76f28c8c 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -1735,14 +1735,33 @@ static void __i915_vma_remove_closed(struct i915_vma *vma) list_del_init(>closed_link); } -void i915_vma_reopen(struct i915_vma *vma) +static struct i915_vma *i915_vma_reopen(struct i915_vma *vma) +{ + if (atomic_read(>flags) & I915_VMA_PARKED) + return NULL; + + __i915_vma_remove_closed(vma); + return vma; +} + +struct i915_vma *i915_vma_open(struct i915_vma *vma) { struct intel_gt *gt = vma->vm->gt; + if (atomic_inc_not_zero(>open_count)) + return vma; + spin_lock_irq(>closed_lock); - if (i915_vma_is_closed(vma)) - __i915_vma_remove_closed(vma); + if (!atomic_inc_not_zero(>open_count)) { + if (i915_vma_is_closed(vma)) + vma = i915_vma_reopen(vma); + + if (vma) + atomic_inc(>open_count); + } spin_unlock_irq(>closed_lock); + + return vma; } static void force_unbind(struct i915_vma *vma) @@ -1770,7 +1789,8 @@ static void release_references(struct i915_vma *vma, struct intel_gt *gt, spin_unlock(>vma.lock); spin_lock_irq(>closed_lock); - __i915_vma_remove_closed(vma); + if (!(atomic_read(>flags) & I915_VMA_PARKED)) + __i915_vma_remove_closed(vma); spin_unlock_irq(>closed_lock); if (vm_ddestroy) @@ -1854,22 +1874,22 @@ void i915_vma_parked(struct intel_gt *gt) } list_move(>closed_link, ); + atomic_or(I915_VMA_PARKED, >flags); } spin_unlock_irq(>closed_lock); - /* As the GT is held idle, no vma can be reopened as we destroy them */ list_for_each_entry_safe(vma, next, , closed_link) { struct drm_i915_gem_object *obj = vma->obj; struct i915_address_space *vm = vma->vm; if (i915_gem_object_trylock(obj, NULL)) { - INIT_LIST_HEAD(>closed_link); i915_vma_destroy(vma); i915_gem_object_unlock(obj); } else { /* back you go.. */ spin_lock_irq(>closed_lock); list_add(>closed_link, >closed_vma); + atomic_andnot(I915_VMA_PARKED, >flags); spin_unlock_irq(>closed_lock); } diff --git a/drivers/gpu/drm/i915/i915_vma.h b/drivers/gpu/drm/i915/i915_vma.h index e356dfb883d34..331d19672c764 100644 --- a/drivers/gpu/drm/i915/i915_vma.h +++ b/drivers/gpu/drm/i915/i915_vma.h @@ -268,7 +268,7 @
Re: [PATCH] drm/i915/gt: Disarm breadcrumbs if engines are already idle
On 4/23/2024 6:23 PM, Janusz Krzysztofik wrote: From: Chris Wilson The breadcrumbs use a GT wakeref for guarding the interrupt, but are disarmed during release of the engine wakeref. This leaves a hole where we may attach a breadcrumb just as the engine is parking (after it has parked its breadcrumbs), execute the irq worker with some signalers still attached, but never be woken again. That issue manifests itself in CI with IGT runner timeouts while tests are waiting indefinitely for release of all GT wakerefs. <6> [209.151778] i915: Running live_engine_pm_selftests/live_engine_busy_stats <7> [209.231628] i915 :00:02.0: [drm:intel_power_well_disable [i915]] disabling PW_5 <7> [209.231816] i915 :00:02.0: [drm:intel_power_well_disable [i915]] disabling PW_4 <7> [209.231944] i915 :00:02.0: [drm:intel_power_well_disable [i915]] disabling PW_3 <7> [209.232056] i915 :00:02.0: [drm:intel_power_well_disable [i915]] disabling PW_2 <7> [209.232166] i915 :00:02.0: [drm:intel_power_well_disable [i915]] disabling DC_off <7> [209.232270] i915 :00:02.0: [drm:skl_enable_dc6 [i915]] Enabling DC6 <7> [209.232368] i915 :00:02.0: [drm:gen9_set_dc_state.part.0 [i915]] Setting DC state from 00 to 02 <4> [299.356116] [IGT] Inactivity timeout exceeded. Killing the current test with SIGQUIT. ... <6> [299.356526] sysrq: Show State ... <6> [299.373964] task:i915_selftest state:D stack:11784 pid:5578 tgid:5578 ppid:873flags:0x4002 <6> [299.373967] Call Trace: <6> [299.373968] <6> [299.373970] __schedule+0x3bb/0xda0 <6> [299.373974] schedule+0x41/0x110 <6> [299.373976] intel_wakeref_wait_for_idle+0x82/0x100 [i915] <6> [299.374083] ? __pfx_var_wake_function+0x10/0x10 <6> [299.374087] live_engine_busy_stats+0x9b/0x500 [i915] <6> [299.374173] __i915_subtests+0xbe/0x240 [i915] <6> [299.374277] ? __pfx___intel_gt_live_setup+0x10/0x10 [i915] <6> [299.374369] ? __pfx___intel_gt_live_teardown+0x10/0x10 [i915] <6> [299.374456] intel_engine_live_selftests+0x1c/0x30 [i915] <6> [299.374547] __run_selftests+0xbb/0x190 [i915] <6> [299.374635] i915_live_selftests+0x4b/0x90 [i915] <6> [299.374717] i915_pci_probe+0x10d/0x210 [i915] At the end of the interrupt worker, if there are no more engines awake, disarm the breadcrumb and go to sleep. Fixes: 9d5612ca165a ("drm/i915/gt: Defer enabling the breadcrumb interrupt to after submission") Closes: https://gitlab.freedesktop.org/drm/intel/issues/10026 Signed-off-by: Chris Wilson Cc: Andrzej Hajda Cc: # v5.12+ Signed-off-by: Janusz Krzysztofik Acked-by: Nirmoy Das I will let others/Andrzej r-b this as I am not very familiar with the code. Thanks, Nirmoy --- drivers/gpu/drm/i915/gt/intel_breadcrumbs.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c index d650beb8ed22f..20b9b04ec1e0b 100644 --- a/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c +++ b/drivers/gpu/drm/i915/gt/intel_breadcrumbs.c @@ -263,8 +263,13 @@ static void signal_irq_work(struct irq_work *work) i915_request_put(rq); } + /* Lazy irq enabling after HW submission */ if (!READ_ONCE(b->irq_armed) && !list_empty(>signalers)) intel_breadcrumbs_arm_irq(b); + + /* And confirm that we still want irqs enabled before we yield */ + if (READ_ONCE(b->irq_armed) && !atomic_read(>active)) + intel_breadcrumbs_disarm_irq(b); } struct intel_breadcrumbs * @@ -315,13 +320,7 @@ void __intel_breadcrumbs_park(struct intel_breadcrumbs *b) return; /* Kick the work once more to drain the signalers, and disarm the irq */ - irq_work_sync(>irq_work); - while (READ_ONCE(b->irq_armed) && !atomic_read(>active)) { - local_irq_disable(); - signal_irq_work(>irq_work); - local_irq_enable(); - cond_resched(); - } + irq_work_queue(>irq_work); } void intel_breadcrumbs_free(struct kref *kref) @@ -404,7 +403,7 @@ static void insert_breadcrumb(struct i915_request *rq) * the request as it may have completed and raised the interrupt as * we were attaching it into the lists. */ - if (!b->irq_armed || __i915_request_is_complete(rq)) + if (!READ_ONCE(b->irq_armed) || __i915_request_is_complete(rq)) irq_work_queue(>irq_work); }
Re: [PATCH v2 2/2] drm/i915: Fix gt reset with GuC submission is disabled
Hi Andi, On 4/23/2024 11:32 AM, Andi Shyti wrote: Hi Nirmoy, On Mon, Apr 22, 2024 at 10:19:51PM +0200, Nirmoy Das wrote: Currently intel_gt_reset() kills the GuC and then resets requested engines. This is problematic because there is a dedicated CSB FIFO which only GuC can access and if that FIFO fills up, the hardware will block on the next context switch until there is space that means the system is effectively hung. If an engine is reset whilst actively executing a context, a CSB entry will be sent to say that the context has gone idle. Thus if reset happens on a very busy system then killing GuC before killing the engines will lead to deadlock because of filled up CSB FIFO. is this a fix? I went quite far back in the commit logs, and it appears to me that we've always been using the current reset flow. I believe we don't perform a GT reset immediately after sending a number of requests, which is what the current failed test is doing. So, I don't think there will be any visible impact on the user with the current flow. To address this issue, the GuC should be killed only after resetting the requested engines and before calling intel_gt_init_hw(). v2: Improve commit message(John) Cc: John Harrison Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index b1393863ca9b..6161f7a3ff70 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -879,8 +879,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt) intel_engine_mask_t awake = 0; enum intel_engine_id id; - /* For GuC mode, ensure submission is disabled before stopping ring */ - intel_uc_reset_prepare(>uc); + /** +* For GuC mode with submission enabled, ensure submission +* is disabled before stopping ring. nit: "stopping *the* ring" Will fix it while merging if I don't have to resend this again. +* +* For GuC mode with submission disabled, ensure that GuC is not +* sanitized, do that after engine reset. reset_prepare() +* is followed by engine reset which in this mode requires GuC to +* process any CSB FIFO entries generated by the resets. +*/ + if (intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); for_each_engine(engine, gt, id) { if (intel_engine_pm_get_if_awake(engine)) @@ -1227,6 +1236,9 @@ void intel_gt_reset(struct intel_gt *gt, intel_overlay_reset(gt->i915); + /* sanitize uC after engine reset */ + if (!intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); Reviewed-by: Andi Shyti Thanks, Nirmoy Thanks, Andi
Re: [PATCH] drm/i915/gt: Refactor uabi engine class/instance list creation
Hi Andi, On 4/17/2024 12:49 AM, Andi Shyti wrote: For the upcoming changes we need a cleaner way to build the list of uabi engines. Suggested-by: Tvrtko Ursulin Signed-off-by: Andi Shyti --- Hi, just sending this patch to unburden the coming series from this single patch inherited from a previously sent series. Andi drivers/gpu/drm/i915/gt/intel_engine_user.c | 29 - 1 file changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 833987015b8b..11cc06c0c785 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -203,7 +203,7 @@ static void engine_rename(struct intel_engine_cs *engine, const char *name, u16 void intel_engines_driver_register(struct drm_i915_private *i915) { - u16 name_instance, other_instance = 0; + u16 class_instance[I915_LAST_UABI_ENGINE_CLASS + 2] = { }; +2 is confusing here. I think we need a better macro. struct legacy_ring ring = {}; struct list_head *it, *next; struct rb_node **p, *prev; @@ -214,6 +214,8 @@ void intel_engines_driver_register(struct drm_i915_private *i915) prev = NULL; p = >uabi_engines.rb_node; list_for_each_safe(it, next, ) { + u16 uabi_class; + struct intel_engine_cs *engine = container_of(it, typeof(*engine), uabi_list); @@ -222,15 +224,14 @@ void intel_engines_driver_register(struct drm_i915_private *i915) GEM_BUG_ON(engine->class >= ARRAY_SIZE(uabi_classes)); engine->uabi_class = uabi_classes[engine->class]; - if (engine->uabi_class == I915_NO_UABI_CLASS) { - name_instance = other_instance++; - } else { - GEM_BUG_ON(engine->uabi_class >= - ARRAY_SIZE(i915->engine_uabi_class_count)); - name_instance = - i915->engine_uabi_class_count[engine->uabi_class]++; - } - engine->uabi_instance = name_instance; + + if (engine->uabi_class == I915_NO_UABI_CLASS) + uabi_class = I915_LAST_UABI_ENGINE_CLASS + 1; + else + uabi_class = engine->uabi_class; + + GEM_BUG_ON(uabi_class >= ARRAY_SIZE(class_instance)); + engine->uabi_instance = class_instance[uabi_class]++; /* * Replace the internal name with the final user and log facing @@ -238,11 +239,15 @@ void intel_engines_driver_register(struct drm_i915_private *i915) */ engine_rename(engine, intel_engine_class_repr(engine->class), - name_instance); + engine->uabi_instance); - if (engine->uabi_class == I915_NO_UABI_CLASS) + if (uabi_class > I915_LAST_UABI_ENGINE_CLASS) continue; + GEM_BUG_ON(uabi_class >= + ARRAY_SIZE(i915->engine_uabi_class_count)); + i915->engine_uabi_class_count[uabi_class]++; Shouldn't this be i915->engine_uabi_class_count[uabi_class] = class_instance[uabi_class]; ? What I see is that this patch mainly adding this class_instance array and rest looks the same. May be it make sense to add other upcoming patches to better understand why we need this patch. Regards, Nirmoy + rb_link_node(>uabi_node, prev, p); rb_insert_color(>uabi_node, >uabi_engines);
[PATCH v2 1/2] drm/i915: Refactor confusing __intel_gt_reset()
__intel_gt_reset() is really for resetting engines though the name might suggest something else. So add a helper function to remove confusions with no functional changes. v2: Move intel_gt_reset_all_engines() next to intel_gt_reset_engine() to make diff simple(John) Cc: John Harrison Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- .../drm/i915/gt/intel_execlists_submission.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt.c| 2 +- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 2 +- drivers/gpu/drm/i915/gt/intel_reset.c | 35 +++ drivers/gpu/drm/i915/gt/intel_reset.h | 3 +- drivers/gpu/drm/i915/gt/selftest_reset.c | 2 +- drivers/gpu/drm/i915/i915_driver.c| 2 +- 8 files changed, 37 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 8c44af1c3451..5c8e9ee3b008 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -678,7 +678,7 @@ void intel_engines_release(struct intel_gt *gt) */ GEM_BUG_ON(intel_gt_pm_is_awake(gt)); if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) - __intel_gt_reset(gt, ALL_ENGINES); + intel_gt_reset_all_engines(gt); /* Decouple the backend; but keep the layout for late GPU resets */ for_each_engine(engine, gt, id) { diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 355aab5b38ba..21829439e686 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -2898,7 +2898,7 @@ static void enable_error_interrupt(struct intel_engine_cs *engine) drm_err(>i915->drm, "engine '%s' resumed still in error: %08x\n", engine->name, status); - __intel_gt_reset(engine->gt, engine->mask); + intel_gt_reset_engine(engine); } /* diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 580b5141ce1e..626b166e67ef 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -832,7 +832,7 @@ void intel_gt_driver_unregister(struct intel_gt *gt) /* Scrub all HW state upon release */ with_intel_runtime_pm(gt->uncore->rpm, wakeref) - __intel_gt_reset(gt, ALL_ENGINES); + intel_gt_reset_all_engines(gt); } void intel_gt_driver_release(struct intel_gt *gt) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 220ac4f92edf..c08fdb65cc69 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -159,7 +159,7 @@ static bool reset_engines(struct intel_gt *gt) if (INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) return false; - return __intel_gt_reset(gt, ALL_ENGINES) == 0; + return intel_gt_reset_all_engines(gt) == 0; } static void gt_sanitize(struct intel_gt *gt, bool force) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index c8e9aa41fdea..b1393863ca9b 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -764,7 +764,7 @@ wa_14015076503_end(struct intel_gt *gt, intel_engine_mask_t engine_mask) HECI_H_GS1_ER_PREP, 0); } -int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) +static int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) { const int retries = engine_mask == ALL_ENGINES ? RESET_MAX_RETRIES : 1; reset_func reset; @@ -978,7 +978,7 @@ static void __intel_gt_set_wedged(struct intel_gt *gt) /* Even if the GPU reset fails, it should still stop the engines */ if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) - __intel_gt_reset(gt, ALL_ENGINES); + intel_gt_reset_all_engines(gt); for_each_engine(engine, gt, id) engine->submit_request = nop_submit_request; @@ -1089,7 +1089,7 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt) /* We must reset pending GPU events before restoring our submission */ ok = !HAS_EXECLISTS(gt->i915); /* XXX better agnosticism desired */ if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) - ok = __intel_gt_reset(gt, ALL_ENGINES) == 0; + ok = intel_gt_reset_all_engines(gt) == 0; if (!ok) { /* * Warn CI about the unrecoverable wedged condition. @@ -1133,10 +1133,10 @@ static int do_reset(struct intel_gt *gt, intel_engine_mask_t stalled_mask) { int err, i;
[PATCH v2 2/2] drm/i915: Fix gt reset with GuC submission is disabled
Currently intel_gt_reset() kills the GuC and then resets requested engines. This is problematic because there is a dedicated CSB FIFO which only GuC can access and if that FIFO fills up, the hardware will block on the next context switch until there is space that means the system is effectively hung. If an engine is reset whilst actively executing a context, a CSB entry will be sent to say that the context has gone idle. Thus if reset happens on a very busy system then killing GuC before killing the engines will lead to deadlock because of filled up CSB FIFO. To address this issue, the GuC should be killed only after resetting the requested engines and before calling intel_gt_init_hw(). v2: Improve commit message(John) Cc: John Harrison Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index b1393863ca9b..6161f7a3ff70 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -879,8 +879,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt) intel_engine_mask_t awake = 0; enum intel_engine_id id; - /* For GuC mode, ensure submission is disabled before stopping ring */ - intel_uc_reset_prepare(>uc); + /** +* For GuC mode with submission enabled, ensure submission +* is disabled before stopping ring. +* +* For GuC mode with submission disabled, ensure that GuC is not +* sanitized, do that after engine reset. reset_prepare() +* is followed by engine reset which in this mode requires GuC to +* process any CSB FIFO entries generated by the resets. +*/ + if (intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); for_each_engine(engine, gt, id) { if (intel_engine_pm_get_if_awake(engine)) @@ -1227,6 +1236,9 @@ void intel_gt_reset(struct intel_gt *gt, intel_overlay_reset(gt->i915); + /* sanitize uC after engine reset */ + if (!intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); /* * Next we need to restore the context, but we don't use those * yet either... -- 2.42.0
Re: [PATCH 3/3] drm/i915: Fix gt reset with GuC submission disabled
Hi John, On 4/19/2024 1:38 AM, John Harrison wrote: On 4/18/2024 10:10, Nirmoy Das wrote: Currently intel_gt_reset() happens as follows: reset_prepare() ---> Sends GDRST to GuC, GuC is in GS_MIA_IN_RESET do_reset() intel_gt_reset_all_engines() *_engine_reset_prepare() -->RESET_CTL expects running GuC Not technically correct. There is no direct connection between RESET_CTL and GuC. *_reset_engines() intel_gt_init_hw() --> GuC comes out of GS_MIA_IN_RESET with FW loaded. Fix the issue by sanitizing the GuC only after resetting requested engines and before intel_gt_init_hw(). You never actually state what the issue is. The problem is that there is a dedicated CSB FIFO going to GuC (and nothing else has access to it). If that FIFO fills up, the hardware will block on the next context switch until there is space. If no-one (i.e. GuC) is draining it, that means the system is effectively hung. If an engine is reset whilst actively executing a context, a CSB entry will be sent to say that the context has gone idle. Thus if you reset a very busy system and start with killing GuC before killing the engines and only then re-enabling GuC, you run the risk of generating more CSB entries than will fit in the FIFO and deadlocking. Whereas, if the system is idle then you can reset the engines as much as you like while GuC is dead and it won't be a problem. I wasn't sure if I could talk about internal details so kept it to minimal. I will borrow above explanation and resend :) Note intel_uc_reset_finish() and intel_uc_reset() are nop when guc submission is disabled. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 6504e8ba9c58..bd166f5aca4b 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -907,8 +907,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt) intel_engine_mask_t awake = 0; enum intel_engine_id id; - /* For GuC mode, ensure submission is disabled before stopping ring */ - intel_uc_reset_prepare(>uc); + /** + * For GuC mode with submission enabled, ensure submission + * is disabled before stopping ring. + * + * For GuC mode with submission disabled, ensure that GuC is not + * sanitized, do that at the end in reset_finish(). reset_prepare() + * is followed by engine reset which in this mode requires GuC to + * be functional to process engine reset events. -> to process any CSB FIFO entries generated by the resets. I will add this. Thanks, Nirmoy John. + */ + if (intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); for_each_engine(engine, gt, id) { if (intel_engine_pm_get_if_awake(engine)) @@ -1255,6 +1264,9 @@ void intel_gt_reset(struct intel_gt *gt, intel_overlay_reset(gt->i915); + /* sanitize uC after engine reset */ + if (!intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); /* * Next we need to restore the context, but we don't use those * yet either...
Re: [PATCH 2/3] drm/i915 Rename intel_engine_reset to intel_gt_engine_recover
Hi John, On 4/19/2024 1:27 AM, John Harrison wrote: On 4/18/2024 10:10, Nirmoy Das wrote: intel_engine_reset() not only reset a engine but also tries to recover it so give it a proper name without any functional changes. Not seeing what the difference is. If this was a super low level function (with an __ prefix for example) then one might expect it to literally just poke the reset register and leave the engine in a dead state. But as a high level function, I think it is reasonable to expect a reset function to 'recover' the entity being reset. Also, many of the callers are tests that are explicitly testing reset. So now the tests all talk about attempting resets, resets failing, etc. but around a call to 'recover' instead of 'reset', which seems confusing. Didn't think about it, I will drop it. Thanks, Nirmoy John. Signed-off-by: Nirmoy Das --- .../drm/i915/gem/selftests/i915_gem_context.c | 2 +- .../drm/i915/gt/intel_execlists_submission.c | 2 +- drivers/gpu/drm/i915/gt/intel_reset.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_reset.h | 4 ++-- drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 20 +-- drivers/gpu/drm/i915/gt/selftest_mocs.c | 4 ++-- drivers/gpu/drm/i915/gt/selftest_reset.c | 2 +- .../gpu/drm/i915/gt/selftest_workarounds.c | 6 +++--- 8 files changed, 22 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c index 89d4dc8b60c6..4f4cde55f621 100644 --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c @@ -1171,7 +1171,7 @@ __sseu_finish(const char *name, int ret = 0; if (flags & TEST_RESET) { - ret = intel_engine_reset(ce->engine, "sseu"); + ret = intel_gt_engine_recover(ce->engine, "sseu"); if (ret) goto out; } diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 21829439e686..9485a622a704 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -2404,7 +2404,7 @@ static void execlists_reset(struct intel_engine_cs *engine, const char *msg) ring_set_paused(engine, 1); /* Freeze the current request in place */ execlists_capture(engine); - intel_engine_reset(engine, msg); + intel_gt_engine_recover(engine, msg); tasklet_enable(>sched_engine->tasklet); clear_and_wake_up_bit(bit, lock); diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index b825daace58e..6504e8ba9c58 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -1348,7 +1348,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg) } /** - * intel_engine_reset - reset GPU engine to recover from a hang + * intel_gt_engine_recover - reset GPU engine to recover from a hang * @engine: engine to reset * @msg: reason for GPU reset; or NULL for no drm_notice() * @@ -1360,7 +1360,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg) * - reset engine (which will force the engine to idle) * - re-init/configure engine */ -int intel_engine_reset(struct intel_engine_cs *engine, const char *msg) +int intel_gt_engine_recover(struct intel_engine_cs *engine, const char *msg) { int err; diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h index c00de353075c..be984357bf27 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.h +++ b/drivers/gpu/drm/i915/gt/intel_reset.h @@ -31,8 +31,8 @@ void intel_gt_handle_error(struct intel_gt *gt, void intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t stalled_mask, const char *reason); -int intel_engine_reset(struct intel_engine_cs *engine, - const char *reason); +int intel_gt_engine_recover(struct intel_engine_cs *engine, + const char *reason); int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *reason); diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c index 9ce8ff1c04fe..9bfda3f2bd24 100644 --- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c +++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c @@ -495,9 +495,9 @@ static int igt_reset_nop_engine(void *arg) i915_request_add(rq); } - err = intel_engine_reset(engine, NULL); + err = intel_gt_engine_recover(engine, NULL); if (err) { - pr_err("intel_engine_reset(%s) failed, err:%d\n", + pr_err("intel_gt_engine_recover(%s) failed, err:%d\n",
Re: [PATCH 1/3] drm/i915: Refactor confusing __intel_gt_reset()
Hi John. On 4/19/2024 1:27 AM, John Harrison wrote: On 4/18/2024 10:10, Nirmoy Das wrote: __intel_gt_reset() is really for resetting engines though the name might suggest something else. So add two helper functions to remove confusions with no functional changes. Technically you only added one and just moved the other :). It already existed, it just wasn't being used everywhere that it could be! I did have one more helper functions but I removed it in favor of intel_gt_reset_engine() but didn't change the commit message :/ Thanks for catching it. I will fix it. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- .../drm/i915/gt/intel_execlists_submission.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 2 +- drivers/gpu/drm/i915/gt/intel_reset.c | 43 ++- drivers/gpu/drm/i915/gt/intel_reset.h | 3 +- drivers/gpu/drm/i915/gt/selftest_reset.c | 2 +- drivers/gpu/drm/i915/i915_driver.c | 2 +- 8 files changed, 41 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 8c44af1c3451..5c8e9ee3b008 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -678,7 +678,7 @@ void intel_engines_release(struct intel_gt *gt) */ GEM_BUG_ON(intel_gt_pm_is_awake(gt)); if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) - __intel_gt_reset(gt, ALL_ENGINES); + intel_gt_reset_all_engines(gt); /* Decouple the backend; but keep the layout for late GPU resets */ for_each_engine(engine, gt, id) { diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 355aab5b38ba..21829439e686 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -2898,7 +2898,7 @@ static void enable_error_interrupt(struct intel_engine_cs *engine) drm_err(>i915->drm, "engine '%s' resumed still in error: %08x\n", engine->name, status); - __intel_gt_reset(engine->gt, engine->mask); + intel_gt_reset_engine(engine); } /* diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 580b5141ce1e..626b166e67ef 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -832,7 +832,7 @@ void intel_gt_driver_unregister(struct intel_gt *gt) /* Scrub all HW state upon release */ with_intel_runtime_pm(gt->uncore->rpm, wakeref) - __intel_gt_reset(gt, ALL_ENGINES); + intel_gt_reset_all_engines(gt); } void intel_gt_driver_release(struct intel_gt *gt) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 220ac4f92edf..c08fdb65cc69 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -159,7 +159,7 @@ static bool reset_engines(struct intel_gt *gt) if (INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) return false; - return __intel_gt_reset(gt, ALL_ENGINES) == 0; + return intel_gt_reset_all_engines(gt) == 0; } static void gt_sanitize(struct intel_gt *gt, bool force) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index c8e9aa41fdea..b825daace58e 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -764,7 +764,7 @@ wa_14015076503_end(struct intel_gt *gt, intel_engine_mask_t engine_mask) HECI_H_GS1_ER_PREP, 0); } -int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) +static int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) { const int retries = engine_mask == ALL_ENGINES ? RESET_MAX_RETRIES : 1; reset_func reset; @@ -795,6 +795,34 @@ int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) return ret; } +/** + * intel_gt_reset_all_engines() - Reset all engines in the given gt. + * @gt: the GT to reset all engines for. + * + * This function resets all engines within the given gt. + * + * Returns: + * Zero on success, negative error code on failure. + */ +int intel_gt_reset_all_engines(struct intel_gt *gt) +{ + return __intel_gt_reset(gt, ALL_ENGINES); +} + +/** + * intel_gt_reset_engine() - Reset a specific engine within a gt. + * @engine: engine to be reset. + * + * This function resets the specified engine within a gt. + * + * Returns: + * Zero on success, negative error code on failure. + */ +int intel_gt_reset_engine(struct intel_engine_cs *engine) +{ + return __intel_gt_reset(engine->gt, engine->mask); +} + You could have just dropped the
[PATCH 3/3] drm/i915: Fix gt reset with GuC submission disabled
Currently intel_gt_reset() happens as follows: reset_prepare() ---> Sends GDRST to GuC, GuC is in GS_MIA_IN_RESET do_reset() intel_gt_reset_all_engines() *_engine_reset_prepare() -->RESET_CTL expects running GuC *_reset_engines() intel_gt_init_hw() --> GuC comes out of GS_MIA_IN_RESET with FW loaded. Fix the issue by sanitizing the GuC only after resetting requested engines and before intel_gt_init_hw(). Note intel_uc_reset_finish() and intel_uc_reset() are nop when guc submission is disabled. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 6504e8ba9c58..bd166f5aca4b 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -907,8 +907,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt) intel_engine_mask_t awake = 0; enum intel_engine_id id; - /* For GuC mode, ensure submission is disabled before stopping ring */ - intel_uc_reset_prepare(>uc); + /** +* For GuC mode with submission enabled, ensure submission +* is disabled before stopping ring. +* +* For GuC mode with submission disabled, ensure that GuC is not +* sanitized, do that at the end in reset_finish(). reset_prepare() +* is followed by engine reset which in this mode requires GuC to +* be functional to process engine reset events. +*/ + if (intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); for_each_engine(engine, gt, id) { if (intel_engine_pm_get_if_awake(engine)) @@ -1255,6 +1264,9 @@ void intel_gt_reset(struct intel_gt *gt, intel_overlay_reset(gt->i915); + /* sanitize uC after engine reset */ + if (!intel_uc_uses_guc_submission(>uc)) + intel_uc_reset_prepare(>uc); /* * Next we need to restore the context, but we don't use those * yet either... -- 2.42.0
[PATCH 2/3] drm/i915 Rename intel_engine_reset to intel_gt_engine_recover
intel_engine_reset() not only reset a engine but also tries to recover it so give it a proper name without any functional changes. Signed-off-by: Nirmoy Das --- .../drm/i915/gem/selftests/i915_gem_context.c | 2 +- .../drm/i915/gt/intel_execlists_submission.c | 2 +- drivers/gpu/drm/i915/gt/intel_reset.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_reset.h | 4 ++-- drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 20 +-- drivers/gpu/drm/i915/gt/selftest_mocs.c | 4 ++-- drivers/gpu/drm/i915/gt/selftest_reset.c | 2 +- .../gpu/drm/i915/gt/selftest_workarounds.c| 6 +++--- 8 files changed, 22 insertions(+), 22 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c index 89d4dc8b60c6..4f4cde55f621 100644 --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c @@ -1171,7 +1171,7 @@ __sseu_finish(const char *name, int ret = 0; if (flags & TEST_RESET) { - ret = intel_engine_reset(ce->engine, "sseu"); + ret = intel_gt_engine_recover(ce->engine, "sseu"); if (ret) goto out; } diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 21829439e686..9485a622a704 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -2404,7 +2404,7 @@ static void execlists_reset(struct intel_engine_cs *engine, const char *msg) ring_set_paused(engine, 1); /* Freeze the current request in place */ execlists_capture(engine); - intel_engine_reset(engine, msg); + intel_gt_engine_recover(engine, msg); tasklet_enable(>sched_engine->tasklet); clear_and_wake_up_bit(bit, lock); diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index b825daace58e..6504e8ba9c58 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -1348,7 +1348,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg) } /** - * intel_engine_reset - reset GPU engine to recover from a hang + * intel_gt_engine_recover - reset GPU engine to recover from a hang * @engine: engine to reset * @msg: reason for GPU reset; or NULL for no drm_notice() * @@ -1360,7 +1360,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg) * - reset engine (which will force the engine to idle) * - re-init/configure engine */ -int intel_engine_reset(struct intel_engine_cs *engine, const char *msg) +int intel_gt_engine_recover(struct intel_engine_cs *engine, const char *msg) { int err; diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h index c00de353075c..be984357bf27 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.h +++ b/drivers/gpu/drm/i915/gt/intel_reset.h @@ -31,8 +31,8 @@ void intel_gt_handle_error(struct intel_gt *gt, void intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t stalled_mask, const char *reason); -int intel_engine_reset(struct intel_engine_cs *engine, - const char *reason); +int intel_gt_engine_recover(struct intel_engine_cs *engine, + const char *reason); int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *reason); diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c index 9ce8ff1c04fe..9bfda3f2bd24 100644 --- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c +++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c @@ -495,9 +495,9 @@ static int igt_reset_nop_engine(void *arg) i915_request_add(rq); } - err = intel_engine_reset(engine, NULL); + err = intel_gt_engine_recover(engine, NULL); if (err) { - pr_err("intel_engine_reset(%s) failed, err:%d\n", + pr_err("intel_gt_engine_recover(%s) failed, err:%d\n", engine->name, err); break; } @@ -574,7 +574,7 @@ static int igt_reset_fail_engine(void *arg) >reset.flags)); force_reset_timeout(engine); - err = intel_engine_reset(engine, NULL); + err = intel_gt_engine_recover(engine, NULL); cancel_reset_timeout(engine); if (err == 0) /* timeouts only generated on gen8+ */ goto skip;
[PATCH 1/3] drm/i915: Refactor confusing __intel_gt_reset()
__intel_gt_reset() is really for resetting engines though the name might suggest something else. So add two helper functions to remove confusions with no functional changes. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- .../drm/i915/gt/intel_execlists_submission.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt.c| 2 +- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 2 +- drivers/gpu/drm/i915/gt/intel_reset.c | 43 ++- drivers/gpu/drm/i915/gt/intel_reset.h | 3 +- drivers/gpu/drm/i915/gt/selftest_reset.c | 2 +- drivers/gpu/drm/i915/i915_driver.c| 2 +- 8 files changed, 41 insertions(+), 17 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 8c44af1c3451..5c8e9ee3b008 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -678,7 +678,7 @@ void intel_engines_release(struct intel_gt *gt) */ GEM_BUG_ON(intel_gt_pm_is_awake(gt)); if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) - __intel_gt_reset(gt, ALL_ENGINES); + intel_gt_reset_all_engines(gt); /* Decouple the backend; but keep the layout for late GPU resets */ for_each_engine(engine, gt, id) { diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 355aab5b38ba..21829439e686 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -2898,7 +2898,7 @@ static void enable_error_interrupt(struct intel_engine_cs *engine) drm_err(>i915->drm, "engine '%s' resumed still in error: %08x\n", engine->name, status); - __intel_gt_reset(engine->gt, engine->mask); + intel_gt_reset_engine(engine); } /* diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 580b5141ce1e..626b166e67ef 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -832,7 +832,7 @@ void intel_gt_driver_unregister(struct intel_gt *gt) /* Scrub all HW state upon release */ with_intel_runtime_pm(gt->uncore->rpm, wakeref) - __intel_gt_reset(gt, ALL_ENGINES); + intel_gt_reset_all_engines(gt); } void intel_gt_driver_release(struct intel_gt *gt) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 220ac4f92edf..c08fdb65cc69 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -159,7 +159,7 @@ static bool reset_engines(struct intel_gt *gt) if (INTEL_INFO(gt->i915)->gpu_reset_clobbers_display) return false; - return __intel_gt_reset(gt, ALL_ENGINES) == 0; + return intel_gt_reset_all_engines(gt) == 0; } static void gt_sanitize(struct intel_gt *gt, bool force) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index c8e9aa41fdea..b825daace58e 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -764,7 +764,7 @@ wa_14015076503_end(struct intel_gt *gt, intel_engine_mask_t engine_mask) HECI_H_GS1_ER_PREP, 0); } -int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) +static int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) { const int retries = engine_mask == ALL_ENGINES ? RESET_MAX_RETRIES : 1; reset_func reset; @@ -795,6 +795,34 @@ int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask) return ret; } +/** + * intel_gt_reset_all_engines() - Reset all engines in the given gt. + * @gt: the GT to reset all engines for. + * + * This function resets all engines within the given gt. + * + * Returns: + * Zero on success, negative error code on failure. + */ +int intel_gt_reset_all_engines(struct intel_gt *gt) +{ + return __intel_gt_reset(gt, ALL_ENGINES); +} + +/** + * intel_gt_reset_engine() - Reset a specific engine within a gt. + * @engine: engine to be reset. + * + * This function resets the specified engine within a gt. + * + * Returns: + * Zero on success, negative error code on failure. + */ +int intel_gt_reset_engine(struct intel_engine_cs *engine) +{ + return __intel_gt_reset(engine->gt, engine->mask); +} + bool intel_has_gpu_reset(const struct intel_gt *gt) { if (!gt->i915->params.reset) @@ -978,7 +1006,7 @@ static void __intel_gt_set_wedged(struct intel_gt *gt) /* Even if the GPU reset fails, it should still stop the engines */ if (!INTEL_INFO(gt->i915)->gpu_reset_clobb
Re: [PATCH v11 02/10] drm/ttm/tests: Delete unnecessary config option
On 4/17/2024 3:03 PM, Karolina Stolarek wrote: DRM KUnit helpers are selected automatically when TTM tests are enabled, so there's no need to do it directly in the .kunitconfig file. Signed-off-by: Karolina Stolarek Reviewed-by: Nirmoy Das --- drivers/gpu/drm/ttm/tests/.kunitconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/ttm/tests/.kunitconfig b/drivers/gpu/drm/ttm/tests/.kunitconfig index 75fdce0cd98e..1ae1ffabd51e 100644 --- a/drivers/gpu/drm/ttm/tests/.kunitconfig +++ b/drivers/gpu/drm/ttm/tests/.kunitconfig @@ -1,4 +1,3 @@ CONFIG_KUNIT=y CONFIG_DRM=y -CONFIG_DRM_KUNIT_TEST_HELPERS=y CONFIG_DRM_TTM_KUNIT_TEST=y
Re: [PATCH v11 03/10] drm/ttm/tests: Set DMA mask in KUnit device
On 4/17/2024 3:03 PM, Karolina Stolarek wrote: In commit d393acce7b3f ("drm/tests: Switch to kunit devices"), DRM test helpers migrated away from using a dummy platform driver in favour of KUnit device. This means that DMA masks for the device are not set but are required by ttm_pool_alloc tests. Set the DMA mask for coherent mappings to unblock testing. Signed-off-by: Karolina Stolarek --- drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c index 7b7c1fa805fc..cb1cd676f8ae 100644 --- a/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c +++ b/drivers/gpu/drm/ttm/tests/ttm_kunit_helpers.c @@ -98,6 +98,9 @@ struct ttm_test_devices *ttm_test_devices_basic(struct kunit *test) devs->dev = drm_kunit_helper_alloc_device(test); KUNIT_ASSERT_NOT_ERR_OR_NULL(test, devs->dev); + /* Set mask for alloc_coherent mappings to enable ttm_pool_alloc testing */ + devs->dev->coherent_dma_mask = -1; DMA_BIT_MASK() would be nice here. I wonder if it make sense to move that to kunit device related calls, anyway this is: Reviewed-by: Nirmoy Das + devs->drm = __drm_kunit_helper_alloc_drm_device(test, devs->dev, sizeof(*devs->drm), 0, DRIVER_GEM);
Re: [PATCH] drm/i915/guc: Fix the fix for reset lock confusion
On 3/30/2024 12:53 AM, john.c.harri...@intel.com wrote: From: John Harrison The previous fix for the circlular lock splat about the busyness worker wasn't quite complete. Even though the reset-in-progress flag is cleared at the start of intel_uc_reset_finish, the entire function is still inside the reset mutex lock. Not sure why the patch appeared to fix the issue both locally and in CI. However, it is now back again. There is a further complication the wedge code path within intel_gt_reset() jumps around so much it results in nested reset_prepare/_finish calls. That is, the call sequence is: intel_gt_reset | reset_prepare | __intel_gt_set_wedged | | reset_prepare | | reset_finish | reset_finish The nested finish means that even if the clear of the in-progress flag was moved to the end of _finish, it would still be clear for the entire second call. Surprisingly, this does not seem to be causing any other problems at present. As an aside, a wedge on fini does not call the finish functions at all. The reset_in_progress flag is left set (twice). So instead of trying to cancel the worker anywhere at all in the reset path, just add a cancel to intel_guc_submission_fini instead. Note that it is not a problem if the worker is still active during a reset. Either it will run before the reset path starts locking things and will simply block the reset code for a tiny amount of time. Or it will run after the locks have been acquired and will early exit due to the try-lock. Also, do not use the reset-in-progress flag to decide whether a synchronous cancel is safe (from a lockdep perspective) or not. Instead, use the actual reset mutex state (both the genuine one and the custom rolled BACKOFF one). Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness flush") Signed-off-by: John Harrison Cc: Zhanjun Dong Cc: John Harrison Cc: Andi Shyti Cc: Daniel Vetter Cc: Daniel Vetter Cc: Rodrigo Vivi Cc: Nirmoy Das Cc: Tvrtko Ursulin Cc: Umesh Nerlige Ramappa Cc: Andrzej Hajda Cc: Matt Roper Cc: Jonathan Cavitt Cc: Prathap Kumar Valsan Cc: Alan Previn Cc: Madhumitha Tolakanahalli Pradeep Cc: Daniele Ceraolo Spurio Cc: Ashutosh Dixit Cc: Dnyaneshwar Bhadane Thanks for the details, looks good to me: Reviewed-by: Nirmoy Das --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 --- drivers/gpu/drm/i915/gt/uc/intel_uc.c | 4 2 files changed, 13 insertions(+), 14 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 16640d6dd0589..00757d6333e88 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc *guc) * Trying to pass a 'need_sync' or 'in_reset' flag all the way down through * every possible call stack is unfeasible. It would be too intrusive to many * areas that really don't care about the GuC backend. However, there is the -* 'reset_in_progress' flag available, so just use that. +* I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for is_locked. +* So just use those. Note that testing both is required due to the hideously +* complex nature of the i915 driver's reset code paths. * * And note that in the case of a reset occurring during driver unload -* (wedge_on_fini), skipping the cancel in _prepare (when the reset flag is set -* is fine because there is another cancel in _finish (when the reset flag is -* not). +* (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini (when the +* reset flag/mutex are set) is fine because there is another explicit cancel in +* intel_guc_submission_fini (when the reset flag/mutex are not). */ - if (guc_to_gt(guc)->uc.reset_in_progress) + if (mutex_is_locked(_to_gt(guc)->reset.mutex) || + test_bit(I915_RESET_BACKOFF, _to_gt(guc)->reset.flags)) cancel_delayed_work(>timestamp.work); else cancel_delayed_work_sync(>timestamp.work); @@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc) unsigned long flags; ktime_t unused; - guc_cancel_busyness_worker(guc); - spin_lock_irqsave(>timestamp.lock, flags); guc_update_pm_timestamp(guc, ); @@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc) void intel_guc_submission_reset_finish(struct intel_guc *guc) { - /* -* Ensure the busyness worker gets cancelled even on a fatal wedge. -* Note that reset_prepare is not allowed to because it confuses lockdep. -*/ - if (guc_submission_initialized(guc)) - guc_cancel_busyness_
Re: [PATCH] drm/i915/gt: Limit the reserved VM space to only the platforms that need it
Hi Andi, On 3/27/2024 9:05 PM, Andi Shyti wrote: Commit 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm") reduces the available VM space of one page in order to apply Wa_16018031267 and Wa_16018063123. This page was reserved indiscrimitely in all platforms even when not needed. Limit it to DG2 onwards. I would use "Limit it to platforms that need WAs" as those WA are only needed till 12.71, otherwise Reviewed-by: Nirmoy Das Fixes: 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm") Signed-off-by: Andi Shyti Cc: Andrzej Hajda Cc: Chris Wilson Cc: Jonathan Cavitt Cc: Nirmoy Das --- drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 3 +++ drivers/gpu/drm/i915/gt/intel_gt.c | 6 ++ drivers/gpu/drm/i915/gt/intel_gt.h | 9 + 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c index 1bd0e041e15c..398d60a66410 100644 --- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c +++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c @@ -961,6 +961,9 @@ static int gen8_init_rsvd(struct i915_address_space *vm) struct i915_vma *vma; int ret; + if (!intel_gt_needs_wa_16018031267(vm->gt)) + return 0; + /* The memory will be used only by GPU. */ obj = i915_gem_object_create_lmem(i915, PAGE_SIZE, I915_BO_ALLOC_VOLATILE | diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 2c6d31b8fc1a..580b5141ce1e 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -1024,6 +1024,12 @@ enum i915_map_type intel_gt_coherent_map_type(struct intel_gt *gt, return I915_MAP_WC; } +bool intel_gt_needs_wa_16018031267(struct intel_gt *gt) +{ + /* Wa_16018031267, Wa_16018063123 */ + return IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 55), IP_VER(12, 71)); +} + bool intel_gt_needs_wa_22016122933(struct intel_gt *gt) { return MEDIA_VER_FULL(gt->i915) == IP_VER(13, 0) && gt->type == GT_MEDIA; diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h index 6e7cab60834c..b5e114d284ad 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.h +++ b/drivers/gpu/drm/i915/gt/intel_gt.h @@ -82,17 +82,18 @@ struct drm_printer; ##__VA_ARGS__); \ } while (0) -#define NEEDS_FASTCOLOR_BLT_WABB(engine) ( \ - IS_GFX_GT_IP_RANGE(engine->gt, IP_VER(12, 55), IP_VER(12, 71)) && \ - engine->class == COPY_ENGINE_CLASS && engine->instance == 0) - static inline bool gt_is_root(struct intel_gt *gt) { return !gt->info.id; } +bool intel_gt_needs_wa_16018031267(struct intel_gt *gt); bool intel_gt_needs_wa_22016122933(struct intel_gt *gt); +#define NEEDS_FASTCOLOR_BLT_WABB(engine) ( \ + intel_gt_needs_wa_16018031267(engine->gt) && \ + engine->class == COPY_ENGINE_CLASS && engine->instance == 0) + static inline struct intel_gt *uc_to_gt(struct intel_uc *uc) { return container_of(uc, struct intel_gt, uc);
Re: [PATCH] drm/i915/gem: Calculate object page offset for partial memory mapping
Hi Andi, On 3/26/2024 12:12 PM, Andi Shyti wrote: Hi Nirmoy, ... diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c index a2195e28b625..57a2dda2c3cc 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c @@ -276,7 +276,7 @@ static vm_fault_t vm_fault_cpu(struct vm_fault *vmf) /* PTEs are revoked in obj->ops->put_pages() */ err = remap_io_sg(area, area->vm_start, area->vm_end - area->vm_start, - obj->mm.pages->sgl, iomap); + obj->mm.pages->sgl, 0, iomap); Why don't we need partial mmap for CPU but only for GTT ? As far as I understood we don't. I have a version with the CPU offset as well in trybot[*] But without support for segmented buffer objects, I don't know how much this has any effect. You confused me more :) Why segmented buffer object is needed for partial CPU mmap but not for GTT ? From high level, GTT and CPU both should support partial mmap unless I missing something here. Sounds like this also need to be cover by a IGT tests. Yes, I it does need some igt work, working on it. Don't we need "Fixes" tag for this? Why should we? I'm not fixing anything here, If userspace expects partial mmap to work then this is a bug/gap in i915 so we need to backport this as far as possible. Need some information about the requirement about why we need this patch suddenly? Regards, Nirmoy I'm just recalculating the mapping not starting from the beginning of the scatter page. Andi [*] https://patchwork.freedesktop.org/patch/584474/?series=131539=2
Re: [PATCH] drm/i915/gem: Calculate object page offset for partial memory mapping
Hi Andi, I have too many questions :) I think the patch makes sense but need more context, see below: On 3/25/2024 2:40 PM, Andi Shyti wrote: To enable partial memory mapping of GPU virtual memory, it's necessary to introduce an offset to the object's memory (obj->mm.pages) scatterlist. This adjustment compensates for instances when userspace mappings do not start from the beginning of the object. Based on a patch by Chris Wilson . Signed-off-by: Andi Shyti Cc: Chris Wilson Cc: Lionel Landwerlin --- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 8 +--- drivers/gpu/drm/i915/i915_mm.c | 12 +++- drivers/gpu/drm/i915/i915_mm.h | 3 ++- 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c index a2195e28b625..57a2dda2c3cc 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c @@ -276,7 +276,7 @@ static vm_fault_t vm_fault_cpu(struct vm_fault *vmf) /* PTEs are revoked in obj->ops->put_pages() */ err = remap_io_sg(area, area->vm_start, area->vm_end - area->vm_start, - obj->mm.pages->sgl, iomap); + obj->mm.pages->sgl, 0, iomap); Why don't we need partial mmap for CPU but only for GTT ? Sounds like this also need to be cover by a IGT tests. Don't we need "Fixes" tag for this? Regards, Nirmoy if (area->vm_flags & VM_WRITE) { GEM_BUG_ON(!i915_gem_object_has_pinned_pages(obj)); @@ -302,14 +302,16 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf) struct i915_ggtt *ggtt = to_gt(i915)->ggtt; bool write = area->vm_flags & VM_WRITE; struct i915_gem_ww_ctx ww; + unsigned long obj_offset; intel_wakeref_t wakeref; struct i915_vma *vma; pgoff_t page_offset; int srcu; int ret; - /* We don't use vmf->pgoff since that has the fake offset */ + obj_offset = area->vm_pgoff - drm_vma_node_start(>vma_node); page_offset = (vmf->address - area->vm_start) >> PAGE_SHIFT; + page_offset += obj_offset; trace_i915_gem_object_fault(obj, page_offset, true, write); @@ -404,7 +406,7 @@ static vm_fault_t vm_fault_gtt(struct vm_fault *vmf) /* Finally, remap it using the new GTT offset */ ret = remap_io_mapping(area, - area->vm_start + (vma->gtt_view.partial.offset << PAGE_SHIFT), + area->vm_start + ((vma->gtt_view.partial.offset - obj_offset) << PAGE_SHIFT), (ggtt->gmadr.start + i915_ggtt_offset(vma)) >> PAGE_SHIFT, min_t(u64, vma->size, area->vm_end - area->vm_start), >iomap); diff --git a/drivers/gpu/drm/i915/i915_mm.c b/drivers/gpu/drm/i915/i915_mm.c index 7998bc74ab49..f5c97a620962 100644 --- a/drivers/gpu/drm/i915/i915_mm.c +++ b/drivers/gpu/drm/i915/i915_mm.c @@ -122,13 +122,15 @@ int remap_io_mapping(struct vm_area_struct *vma, * @addr: target user address to start at * @size: size of map area * @sgl: Start sg entry + * @offset: offset from the start of the page * @iobase: Use stored dma address offset by this address or pfn if -1 * * Note: this is only safe if the mm semaphore is held when called. */ int remap_io_sg(struct vm_area_struct *vma, unsigned long addr, unsigned long size, - struct scatterlist *sgl, resource_size_t iobase) + struct scatterlist *sgl, unsigned long offset, + resource_size_t iobase) { struct remap_pfn r = { .mm = vma->vm_mm, @@ -141,6 +143,14 @@ int remap_io_sg(struct vm_area_struct *vma, /* We rely on prevalidation of the io-mapping to skip track_pfn(). */ GEM_BUG_ON((vma->vm_flags & EXPECTED_FLAGS) != EXPECTED_FLAGS); + while (offset >= sg_dma_len(r.sgt.sgp) >> PAGE_SHIFT) { + offset -= sg_dma_len(r.sgt.sgp) >> PAGE_SHIFT; + r.sgt = __sgt_iter(__sg_next(r.sgt.sgp), use_dma(iobase)); + if (!r.sgt.sgp) + return -EINVAL; + } + r.sgt.curr = offset << PAGE_SHIFT; + if (!use_dma(iobase)) flush_cache_range(vma, addr, size); diff --git a/drivers/gpu/drm/i915/i915_mm.h b/drivers/gpu/drm/i915/i915_mm.h index 04c8974d822b..69f9351b1a1c 100644 --- a/drivers/gpu/drm/i915/i915_mm.h +++ b/drivers/gpu/drm/i915/i915_mm.h @@ -30,6 +30,7 @@ int remap_io_mapping(struct vm_area_struct *vma, int remap_io_sg(struct vm_area_struct *vma, unsigned long addr, unsigned long size, - struct scatterlist *sgl, resource_size_t iobase); + struct scatterlist *sgl, unsigned long offset, + resource_size_t iobase); #endif /* __I915_MM_H__ */
Re: [PATCH v2] drm/i915/gt: Report full vm address range
Hi Andi, On 3/21/2024 4:17 PM, Andi Shyti wrote: Commit 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm") has reserved an object for kernel space usage. Userspace, though, needs to know the full address range. In the former patch the reserved space was substructed from the total amount of the VM space. Add it back when the user requests the GTT size through ioctl (I915_CONTEXT_PARAM_GTT_SIZE). Fixes: 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm") Signed-off-by: Andi Shyti Cc: Andrzej Hajda Cc: Chris Wilson Cc: Lionel Landwerlin Cc: Michal Mrozek Cc: Nirmoy Das Cc: # v6.2+ Acked-by: Michal Mrozek Acked-by: Lionel Landwerlin --- Hi, Just proposing a different implementation that doesn't affect i915 internally but provides the same result. Instead of not substracting the space during the reservation, I add it back during the ioctl call. All the "vm->rsvd.vma->node.size" looks a bit ugly, Yes, this need document and also vm->total should be vm->total and may be we should have vm->usable which will be used by kernel internal and return vm->total. For me, I am fine with the kernel change as long as UMD is aware/fine of side-effect if UMD ended up using the reserved page. Basically we need to document this well :) Also may be we should limit this reserving page only on platform where it is required ? Regards, Nirmoy but that's how it is. Maybe a comment can help to understand better why there is this addition. I kept the Ack from Michal and Lionel, because the outcome from userspace perspactive doesn't really change. Andi drivers/gpu/drm/i915/gem/i915_gem_context.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index 81f65cab1330..60d9e7fe33b3 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -2454,7 +2454,7 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, case I915_CONTEXT_PARAM_GTT_SIZE: args->size = 0; vm = i915_gem_context_get_eb_vm(ctx); - args->value = vm->total; + args->value = vm->total + vm->rsvd.vma->node.size; i915_vm_put(vm); break;
Re: [PATCH] drm/i915/gt: Report full vm address range
On 3/14/2024 3:04 PM, Lionel Landwerlin wrote: Hi Andi, In Mesa we've been relying on I915_CONTEXT_PARAM_GTT_SIZE so as long as that is adjusted by the kernel What do you mean by adjusted by, should it be a aligned size? I915_CONTEXT_PARAM_GTT_SIZE ioctl is returning vm->total which is adjusted(reduced by a page). This patch might cause silent error as it is not removing WABB which is using the reserved page to add dummy blt and if userspace is using that page then it will be overwritten. Regards, Nirmoy , we should be able to continue working without issues. Acked-by: Lionel Landwerlin Thanks, -Lionel On 13/03/2024 21:39, Andi Shyti wrote: Commit 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm") has reserved an object for kernel space usage. Userspace, though, needs to know the full address range. Fixes: 9bb66c179f50 ("drm/i915: Reserve some kernel space per vm") Signed-off-by: Andi Shyti Cc: Andrzej Hajda Cc: Chris Wilson Cc: Lionel Landwerlin Cc: Michal Mrozek Cc: Nirmoy Das Cc: # v6.2+ --- drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c index fa46d2308b0e..d76831f50106 100644 --- a/drivers/gpu/drm/i915/gt/gen8_ppgtt.c +++ b/drivers/gpu/drm/i915/gt/gen8_ppgtt.c @@ -982,8 +982,9 @@ static int gen8_init_rsvd(struct i915_address_space *vm) vm->rsvd.vma = i915_vma_make_unshrinkable(vma); vm->rsvd.obj = obj; - vm->total -= vma->node.size; + return 0; + unref: i915_gem_object_put(obj); return ret;
Re: [PATCH] drm/i915/selftests: Pick correct caching mode.
On 3/12/2024 3:28 PM, Andi Shyti wrote: Hi Nirmoy, On Tue, Mar 12, 2024 at 12:18:15PM +0100, Nirmoy Das wrote: Caching mode is HW dependent so pick a correct one using intel_gt_coherent_map_type(). Cc: Andi Shyti Cc: Janusz Krzysztofik Cc: Jonathan Cavitt Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10249 Signed-off-by: Nirmoy Das I think it's a good choice not to have the Fixes tag here either. Yes, fixes tag isn't needed for selftests Reviewed-by: Andi Shyti Thanks, Nirmoy Thanks, Andi
[PATCH] drm/i915/selftests: Pick correct caching mode.
Caching mode is HW dependent so pick a correct one using intel_gt_coherent_map_type(). Cc: Andi Shyti Cc: Janusz Krzysztofik Cc: Jonathan Cavitt Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/10249 Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c index d684a70f2c04..65a931ea80e9 100644 --- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_dmabuf.c @@ -7,6 +7,7 @@ #include "i915_drv.h" #include "i915_selftest.h" #include "gem/i915_gem_context.h" +#include "gt/intel_gt.h" #include "mock_context.h" #include "mock_dmabuf.h" @@ -155,6 +156,7 @@ static int verify_access(struct drm_i915_private *i915, struct file *file; u32 *vaddr; int err = 0, i; + unsigned int mode; file = mock_file(i915); if (IS_ERR(file)) @@ -194,7 +196,8 @@ static int verify_access(struct drm_i915_private *i915, if (err) goto out_file; - vaddr = i915_gem_object_pin_map_unlocked(native_obj, I915_MAP_WB); + mode = intel_gt_coherent_map_type(to_gt(i915), native_obj, true); + vaddr = i915_gem_object_pin_map_unlocked(native_obj, mode); if (IS_ERR(vaddr)) { err = PTR_ERR(vaddr); goto out_file; -- 2.42.0
Re: [PATCH v7 2/3] drm/i915: Remove extra multi-gt pm-references
On 3/5/2024 3:35 PM, Janusz Krzysztofik wrote: There was an attempt to fix an issue of illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle, reported by CI on 2-GT Meteor Lake. As a solution, an extra wakeref for a Primary GT was acquired from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, that fix occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. Moreover, CI reports indicated that single-GT platforms also suffered sporadically from the same race. Since the issue has now been fixed by a preceding patch "drm/i915/vma: Fix UAF on destroy against retire race", drop the no longer useful changes introduced by that insufficient fix. v3: Also drop the no longer used .wakeref_gt0 field from struct i915_execbuffer. v2: Avoid the word "revert" in commit message (Rodrigo), - update commit description reusing relevant chunks dropped from the description of the proper fix (Rodrigo). Signed-off-by: Janusz Krzysztofik Cc: Nirmoy Das Cc: Rodrigo Vivi Reviewed-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c | 18 -- 1 file changed, 18 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index d3a771afb083e..3f20fe3811999 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -255,7 +255,6 @@ struct i915_execbuffer { struct intel_context *context; /* logical state for the request */ struct i915_gem_context *gem_context; /** caller's context */ intel_wakeref_t wakeref; - intel_wakeref_t wakeref_gt0; /** our requests to build */ struct i915_request *requests[MAX_ENGINE_INSTANCE + 1]; @@ -2686,7 +2685,6 @@ static int eb_select_engine(struct i915_execbuffer *eb) { struct intel_context *ce, *child; - struct intel_gt *gt; unsigned int idx; int err; @@ -2710,17 +2708,10 @@ eb_select_engine(struct i915_execbuffer *eb) } } eb->num_batches = ce->parallel.number_children + 1; - gt = ce->engine->gt; for_each_child(ce, child) intel_context_get(child); eb->wakeref = intel_gt_pm_get(ce->engine->gt); - /* -* Keep GT0 active on MTL so that i915_vma_parked() doesn't -* free VMAs while execbuf ioctl is validating VMAs. -*/ - if (gt->info.id) - eb->wakeref_gt0 = intel_gt_pm_get(to_gt(gt->i915)); if (!test_bit(CONTEXT_ALLOC_BIT, >flags)) { err = intel_context_alloc_state(ce); @@ -2759,9 +2750,6 @@ eb_select_engine(struct i915_execbuffer *eb) return err; err: - if (gt->info.id) - intel_gt_pm_put(to_gt(gt->i915), eb->wakeref_gt0); - intel_gt_pm_put(ce->engine->gt, eb->wakeref); for_each_child(ce, child) intel_context_put(child); @@ -2775,12 +2763,6 @@ eb_put_engine(struct i915_execbuffer *eb) struct intel_context *child; i915_vm_put(eb->context->vm); - /* -* This works in conjunction with eb_select_engine() to prevent -* i915_vma_parked() from interfering while execbuf validates vmas. -*/ - if (eb->gt->info.id) - intel_gt_pm_put(to_gt(eb->gt->i915), eb->wakeref_gt0); intel_gt_pm_put(eb->context->engine->gt, eb->wakeref); for_each_child(eb->context, child) intel_context_put(child);
Re: [PATCH v7 1/3] drm/i915/vma: Fix UAF on destroy against retire race
On 3/5/2024 3:35 PM, Janusz Krzysztofik wrote: Object debugging tools were sporadically reporting illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle. [161.359441] ODEBUG: free active (active state 0) object: 88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 ... [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 ... [161.361347] debug_object_free+0xeb/0x110 [161.361362] i915_active_fini+0x14/0x130 [i915] [161.361866] release_references+0xfe/0x1f0 [i915] [161.362543] i915_vma_parked+0x1db/0x380 [i915] [161.363129] __gt_park+0x121/0x230 [i915] [161.363515] intel_wakeref_put_last+0x1f/0x70 [i915] That has been tracked down to be happening when another thread is deactivating the VMA inside __active_retire() helper, after the VMA's active counter has been already decremented to 0, but before deactivation of the VMA's object is reported to the object debugging tool. We could prevent from that race by serializing i915_active_fini() with __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from being used, e.g. from __i915_vma_retire() called at the end of __active_retire(), after that VMA has been already freed by a concurrent i915_vma_destroy() on return from the i915_active_fini(). Then, we should rather fix the issue at the VMA level, not in i915_active. Since __i915_vma_parked() is called from __gt_park() on last put of the GT's wakeref, the issue could be addressed by holding the GT wakeref long enough for __active_retire() to complete before that wakeref is released and the GT parked. I believe the issue was introduced by commit d93939730347 ("drm/i915: Remove the vma refcount") which moved a call to i915_active_fini() from a dropped i915_vma_release(), called on last put of the removed VMA kref, to i915_vma_parked() processing path called on last put of a GT wakeref. However, its visibility to the object debugging tool was suppressed by a bug in i915_active that was fixed two weeks later with commit e92eb246feb9 ("drm/i915/active: Fix missing debug object activation"). A VMA associated with a request doesn't acquire a GT wakeref by itself. Instead, it depends on a wakeref held directly by the request's active intel_context for a GT associated with its VM, and indirectly on that intel_context's engine wakeref if the engine belongs to the same GT as the VMA's VM. Those wakerefs are released asynchronously to VMA deactivation. Fix the issue by getting a wakeref for the VMA's GT when activating it, and putting that wakeref only after the VMA is deactivated. However, exclude global GTT from that processing path, otherwise the GPU never goes idle. Since __i915_vma_retire() may be called from atomic contexts, use async variant of wakeref put. Also, to avoid circular locking dependency, take care of acquiring the wakeref before VM mutex when both are needed. v7: Add inline comments with justifications for: - using untracked variants of intel_gt_pm_get/put() (Nirmoy), - using async variant of _put(), - not getting the wakeref in case of a global GTT, - always getting the first wakeref outside vm->mutex. v6: Since __i915_vma_active/retire() callbacks are not serialized, storing a wakeref tracking handle inside struct i915_vma is not safe, and there is no other good place for that. Use untracked variants of intel_gt_pm_get/put_async(). v5: Replace "tile" with "GT" across commit description (Rodrigo), - avoid mentioning multi-GT case in commit description (Rodrigo), - explain why we need to take a temporary wakeref unconditionally inside i915_vma_pin_ww() (Rodrigo). v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm wakerefs") (Andi), - for more easy backporting, split out removal of former insufficient workarounds and move them to separate patches (Nirmoy). - clean up commit message and description a bit. v3: Identify root cause more precisely, and a commit to blame, - identify and drop former workarounds, - update commit message and description. v2: Get the wakeref before VM mutex to avoid circular locking dependency, - drop questionable Fixes: tag. Fixes: d93939730347 ("drm/i915: Remove the vma refcount") Closes:https://gitlab.freedesktop.org/drm/intel/issues/8875 Signed-off-by: Janusz Krzysztofik Cc: Thomas Hellström Cc: Nirmoy Das Cc: Andi Shyti Cc: Rodrigo Vivi Cc:sta...@vger.kernel.org # v5.19+ Reviewed-by: Nirmoy Das --- drivers/
Re: [PATCH v7 3/3] Revert "drm/i915: Wait for active retire before i915_active_fini()"
On 3/5/2024 3:35 PM, Janusz Krzysztofik wrote: This reverts commit 7a2280e8dcd2f1f436db9631287c0b21cf6a92b0, obsoleted by "drm/i915/vma: Fix UAF on destroy against retire race". Signed-off-by: Janusz Krzysztofik Cc: Nirmoy Das Reviewed-by: Nirmoy Das --- drivers/gpu/drm/i915/i915_vma.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index b70715b1411d6..d2f064d2525cc 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -1776,8 +1776,6 @@ static void release_references(struct i915_vma *vma, struct intel_gt *gt, if (vm_ddestroy) i915_vm_resv_put(vma->vm); - /* Wait for async active retire */ - i915_active_wait(>active); i915_active_fini(>active); GEM_WARN_ON(vma->resource); i915_vma_free(vma);
Re: [PATCH v6 1/3] drm/i915/vma: Fix UAF on destroy against retire race
On 3/1/2024 8:29 AM, Janusz Krzysztofik wrote: Object debugging tools were sporadically reporting illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle. [161.359441] ODEBUG: free active (active state 0) object: 88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 ... [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 ... [161.361347] debug_object_free+0xeb/0x110 [161.361362] i915_active_fini+0x14/0x130 [i915] [161.361866] release_references+0xfe/0x1f0 [i915] [161.362543] i915_vma_parked+0x1db/0x380 [i915] [161.363129] __gt_park+0x121/0x230 [i915] [161.363515] intel_wakeref_put_last+0x1f/0x70 [i915] That has been tracked down to be happening when another thread is deactivating the VMA inside __active_retire() helper, after the VMA's active counter has been already decremented to 0, but before deactivation of the VMA's object is reported to the object debugging tool. We could prevent from that race by serializing i915_active_fini() with __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from being used, e.g. from __i915_vma_retire() called at the end of __active_retire(), after that VMA has been already freed by a concurrent i915_vma_destroy() on return from the i915_active_fini(). Then, we should rather fix the issue at the VMA level, not in i915_active. Since __i915_vma_parked() is called from __gt_park() on last put of the GT's wakeref, the issue could be addressed by holding the GT wakeref long enough for __active_retire() to complete before that wakeref is released and the GT parked. I believe the issue was introduced by commit d93939730347 ("drm/i915: Remove the vma refcount") which moved a call to i915_active_fini() from a dropped i915_vma_release(), called on last put of the removed VMA kref, to i915_vma_parked() processing path called on last put of a GT wakeref. However, its visibility to the object debugging tool was suppressed by a bug in i915_active that was fixed two weeks later with commit e92eb246feb9 ("drm/i915/active: Fix missing debug object activation"). A VMA associated with a request doesn't acquire a GT wakeref by itself. Instead, it depends on a wakeref held directly by the request's active intel_context for a GT associated with its VM, and indirectly on that intel_context's engine wakeref if the engine belongs to the same GT as the VMA's VM. Those wakerefs are released asynchronously to VMA deactivation. Fix the issue by getting a wakeref for the VMA's GT when activating it, and putting that wakeref only after the VMA is deactivated. However, exclude global GTT from that processing path, otherwise the GPU never goes idle. Since __i915_vma_retire() may be called from atomic contexts, use async variant of wakeref put. Also, to avoid circular locking dependency, take care of acquiring the wakeref before VM mutex when both are needed. v6: Since __i915_vma_active/retire() callbacks are not serialized, storing a wakeref tracking handle inside struct i915_vma is not safe, and there is no other good place for that. Use untracked variants of intel_gt_pm_get/put_async(). v5: Replace "tile" with "GT" across commit description (Rodrigo), - avoid mentioning multi-GT case in commit description (Rodrigo), - explain why we need to take a temporary wakeref unconditionally inside i915_vma_pin_ww() (Rodrigo). v4: Refresh on top of commit 5e4e06e4087e ("drm/i915: Track gt pm wakerefs") (Andi), - for more easy backporting, split out removal of former insufficient workarounds and move them to separate patches (Nirmoy). - clean up commit message and description a bit. v3: Identify root cause more precisely, and a commit to blame, - identify and drop former workarounds, - update commit message and description. v2: Get the wakeref before VM mutex to avoid circular locking dependency, - drop questionable Fixes: tag. Fixes: d93939730347 ("drm/i915: Remove the vma refcount") Closes: https://gitlab.freedesktop.org/drm/intel/issues/8875 Signed-off-by: Janusz Krzysztofik Cc: Thomas Hellström Cc: Nirmoy Das Cc: Andi Shyti Cc: Rodrigo Vivi Cc: sta...@vger.kernel.org # v5.19+ --- drivers/gpu/drm/i915/i915_vma.c | 26 +++--- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index d09aad34ba37f..ffe81fe338f7e 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/
Re: [PATCH] drm/i915: check before removing mm notifier
On 2/28/2024 2:24 PM, Tvrtko Ursulin wrote: On 27/02/2024 09:26, Nirmoy Das wrote: Hi Tvrtko, On 2/27/2024 10:04 AM, Tvrtko Ursulin wrote: On 21/02/2024 11:52, Nirmoy Das wrote: Merged it to drm-intel-gt-next with s/check/Check Shouldn't this have had: Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.") Cc: # v5.13+ ? Yes. Sorry, I missed that. Can we still the tag ? I've added them and force pushed the branch since commit was still at the top. Thanks a lot, Tvrtko! FYI + Jani, Joonas and Rodrigo Regards, Tvrtko Thanks, Nirmoy Regards, Tvrtko On 2/19/2024 1:50 PM, Nirmoy Das wrote: Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Cc: Andi Shyti Cc: Shawn Lee Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index 0e21ce9d3e5a..61abfb505766 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj) { GEM_WARN_ON(obj->userptr.page_ref); + if (!obj->userptr.notifier.mm) + return; + mmu_interval_notifier_remove(>userptr.notifier); obj->userptr.notifier.mm = NULL; }
Re: [PATCH] drm/i915: check before removing mm notifier
Hi Tvrtko, On 2/27/2024 10:04 AM, Tvrtko Ursulin wrote: On 21/02/2024 11:52, Nirmoy Das wrote: Merged it to drm-intel-gt-next with s/check/Check Shouldn't this have had: Fixes: ed29c2691188 ("drm/i915: Fix userptr so we do not have to worry about obj->mm.lock, v7.") Cc: # v5.13+ ? Yes. Sorry, I missed that. Can we still the tag ? Thanks, Nirmoy Regards, Tvrtko On 2/19/2024 1:50 PM, Nirmoy Das wrote: Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Cc: Andi Shyti Cc: Shawn Lee Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index 0e21ce9d3e5a..61abfb505766 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj) { GEM_WARN_ON(obj->userptr.page_ref); + if (!obj->userptr.notifier.mm) + return; + mmu_interval_notifier_remove(>userptr.notifier); obj->userptr.notifier.mm = NULL; }
Re: [PATCH] drm/i915: check before removing mm notifier
Merged it to drm-intel-gt-next with s/check/Check On 2/19/2024 1:50 PM, Nirmoy Das wrote: Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Cc: Andi Shyti Cc: Shawn Lee Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index 0e21ce9d3e5a..61abfb505766 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj) { GEM_WARN_ON(obj->userptr.page_ref); + if (!obj->userptr.notifier.mm) + return; + mmu_interval_notifier_remove(>userptr.notifier); obj->userptr.notifier.mm = NULL; }
Re: [PATCH] drm/i915: check before removing mm notifier
Hi Rodrigo, On 2/19/2024 9:12 PM, Rodrigo Vivi wrote: On Mon, Feb 19, 2024 at 01:50:47PM +0100, Nirmoy Das wrote: Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Cc: Andi Shyti Cc: Shawn Lee Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index 0e21ce9d3e5a..61abfb505766 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj) { GEM_WARN_ON(obj->userptr.page_ref); + if (!obj->userptr.notifier.mm) + return; + hmmm... right, it looks that we need this protection. But... I mean, feel free to use Reviewed-by: Rodrigo Vivi for this patch, but I believe that if this mmu insert failed we might have other deeper problems like when checking i915_gem_object_is_userptr() ? No?! We are returning an error if mmu insert fails while creating a userptr object so the obj struct is only available to obj cleanup methods. As far as I see, i915_gem_object_is_userptr() should not happen on such obj struct. Thanks, Nirmoy mmu_interval_notifier_remove(>userptr.notifier); obj->userptr.notifier.mm = NULL; } -- 2.42.0
[PATCH] drm/i915: check before removing mm notifier
Error in mmu_interval_notifier_insert() can leave a NULL notifier.mm pointer. Catch that and return early. Cc: Andi Shyti Cc: Shawn Lee Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index 0e21ce9d3e5a..61abfb505766 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -349,6 +349,9 @@ i915_gem_userptr_release(struct drm_i915_gem_object *obj) { GEM_WARN_ON(obj->userptr.page_ref); + if (!obj->userptr.notifier.mm) + return; + mmu_interval_notifier_remove(>userptr.notifier); obj->userptr.notifier.mm = NULL; } -- 2.42.0
Re: [PATCH v5 0/3] drm/i915: Fix VMA UAF on destroy against deactivate race
Hi Janusz, There seems to be a regression in CI related to this: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_129026v2/bat-dg1-7/igt@gem_lmem_swapping@random-engi...@lmem0.html#dmesg-warnings1053 Please have a look. Regards, Nirmoy On 1/24/2024 6:13 PM, Janusz Krzysztofik wrote: Object debugging tools were sporadically reporting illegal attempts to free a still active i915 VMA object when parking a GT believed to be idle. [161.359441] ODEBUG: free active (active state 0) object: 88811643b958 object type: i915_active hint: __i915_vma_active+0x0/0x50 [i915] [161.360082] WARNING: CPU: 5 PID: 276 at lib/debugobjects.c:514 debug_print_object+0x80/0xb0 ... [161.360304] CPU: 5 PID: 276 Comm: kworker/5:2 Not tainted 6.5.0-rc1-CI_DRM_13375-g003f860e5577+ #1 [161.360314] Hardware name: Intel Corporation Rocket Lake Client Platform/RocketLake S UDIMM 6L RVP, BIOS RKLSFWI1.R00.3173.A03.2204210138 04/21/2022 [161.360322] Workqueue: i915-unordered __intel_wakeref_put_work [i915] [161.360592] RIP: 0010:debug_print_object+0x80/0xb0 ... [161.361347] debug_object_free+0xeb/0x110 [161.361362] i915_active_fini+0x14/0x130 [i915] [161.361866] release_references+0xfe/0x1f0 [i915] [161.362543] i915_vma_parked+0x1db/0x380 [i915] [161.363129] __gt_park+0x121/0x230 [i915] [161.363515] intel_wakeref_put_last+0x1f/0x70 [i915] That has been tracked down to be happening when another thread is deactivating the VMA inside __active_retire() helper, after the VMA's active counter has been already decremented to 0, but before deactivation of the VMA's object is reported to the object debugging tool. We could prevent from that race by serializing i915_active_fini() with __active_retire() via ref->tree_lock, but that wouldn't stop the VMA from being used, e.g. from __i915_vma_retire() called at the end of __active_retire(), after that VMA has been already freed by a concurrent i915_vma_destroy() on return from the i915_active_fini(). Then, we should rather fix the issue at the VMA level, not in i915_active. Since __i915_vma_parked() is called from __gt_park() on last put of the GT's wakeref, the issue could be addressed by holding the GT wakeref long enough for __active_retire() to complete before that wakeref is released and the GT parked. A VMA associated with a request doesn't acquire a GT wakeref by itself. Instead, it depends on a wakeref held directly by the request's active intel_context for a GT associated with its VM, and indirectly on that intel_context's engine wakeref if the engine belongs to the same GT as the VMA's VM. Those wakerefs are released asynchronously to VMA deactivation. In case of single-GT platforms, at least one of those wakerefs is usually held long enough for the request's VMA to be deactivated on time, before it is destroyed on last put of its VM GT wakeref. However, on multi-GT platforms, a request may use a VMA from a GT other than the one that hosts the request's engine, then it is protected only with the intel_context's VM GT wakeref. There was an attempt to fix the issue on 2-GT Meteor Lake by acquiring an extra wakeref for a Primary GT from i915_gem_do_execbuffer() -- see commit f56fe3e91787 ("drm/i915: Fix a VMA UAF for multi-gt platform"). However, that fix occurred insufficient -- the issue was still reported by CI. That wakeref was released on exit from i915_gem_do_execbuffer(), then potentially before completion of the request and deactivation of its associated VMAs. Moreover, CI reports indicated that single-GT platforms also suffered sporadically from the same race. I believe the issue was introduced by commit d93939730347 ("drm/i915: Remove the vma refcount") which moved a call to i915_active_fini() from a dropped i915_vma_release(), called on last put of the removed VMA kref, to i915_vma_parked() processing path called on last put of a GT wakeref. However, its visibility to the object debugging tool was suppressed by a bug in i915_active that was fixed two weeks later with commit e92eb246feb9 ("drm/i915/active: Fix missing debug object activation"). Fix the issue by getting a wakeref for the VMA's GT when activating it, and putting that wakeref only after the VMA is deactivated. However, exclude global GTT from that processing path, otherwise the GPU never goes idle. Since __i915_vma_retire() may be called from atomic contexts, use async variant of wakeref put. Also, to avoid circular locking dependency, take care of acquiring the wakeref before VM mutex when both are needed. Having that fixed, stop explicitly acquiring the extra GT0 wakeref from inside i915_gem_do_execbuffer(), and also drop an extra call to i915_active_wait(), introduced by commit 7a2280e8dcd2 ("drm/i915: Wait for active retire before i915_active_fini()") as another insufficient fix for this UAF race. v5: Replace "tile" with "GT" across commit descriptions (Rodrigo), - reword commit message and description
Re: [PATCH v3 4/4] drm/i915/guc: Use the ce_to_guc() wrapper whenever possible
On 12/6/2023 9:46 PM, Andi Shyti wrote: Get the guc reference from the ce using the ce_to_guc() helper. Just a leftover from previous cleanups. Signed-off-by: Andi Shyti Reviewed-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 4f51cc5f1604..3c7821ae9f0d 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -3513,7 +3513,7 @@ static inline void sub_context_inflight_prio(struct intel_context *ce, static inline void update_context_prio(struct intel_context *ce) { - struct intel_guc *guc = >engine->gt->uc.guc; + struct intel_guc *guc = ce_to_guc(ce); int i; BUILD_BUG_ON(GUC_CLIENT_PRIORITY_KMD_HIGH != 0);
Re: [PATCH v3 3/4] drm/i915: Use the new gt_to_guc() wrapper
On 12/6/2023 9:46 PM, Andi Shyti wrote: Get the guc reference from the gt using the gt_to_guc() helper. Signed-off-by: Andi Shyti Reviewed-by: Nirmoy Das --- drivers/gpu/drm/i915/i915_debugfs_params.c | 2 +- drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs_params.c b/drivers/gpu/drm/i915/i915_debugfs_params.c index 8bca02025e09..74b7f2fd8b57 100644 --- a/drivers/gpu/drm/i915/i915_debugfs_params.c +++ b/drivers/gpu/drm/i915/i915_debugfs_params.c @@ -43,7 +43,7 @@ static int notify_guc(struct drm_i915_private *i915) for_each_gt(gt, i915, i) { if (intel_uc_uses_guc_submission(>uc)) - ret = intel_guc_global_policies_update(>uc.guc); + ret = intel_guc_global_policies_update(gt_to_guc(gt)); } return ret; diff --git a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c index 2990dd4d4a0d..d9d8f0336702 100644 --- a/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c +++ b/drivers/gpu/drm/i915/selftests/intel_scheduler_helpers.c @@ -65,7 +65,7 @@ int intel_selftest_modify_policy(struct intel_engine_cs *engine, if (!intel_engine_uses_guc(engine)) return 0; - err = intel_guc_global_policies_update(>gt->uc.guc); + err = intel_guc_global_policies_update(gt_to_guc(engine->gt)); if (err) intel_selftest_restore_policy(engine, saved); @@ -84,7 +84,7 @@ int intel_selftest_restore_policy(struct intel_engine_cs *engine, if (!intel_engine_uses_guc(engine)) return 0; - return intel_guc_global_policies_update(>gt->uc.guc); + return intel_guc_global_policies_update(gt_to_guc(engine->gt)); } int intel_selftest_wait_for_rq(struct i915_request *rq)
Re: [PATCH v3 2/4] drm/i915/guc: Use the new gt_to_guc() wrapper
On 12/6/2023 9:46 PM, Andi Shyti wrote: Get the guc reference from the gt using the gt_to_guc() helper. Signed-off-by: Andi Shyti Reviewed-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c | 4 +-- drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c | 3 +- drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c| 2 +- .../gpu/drm/i915/gt/uc/intel_guc_capture.c| 6 ++-- .../gpu/drm/i915/gt/uc/intel_guc_hwconfig.c | 2 +- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 28 +-- drivers/gpu/drm/i915/gt/uc/intel_huc.c| 4 +-- drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 4 +-- drivers/gpu/drm/i915/gt/uc/selftest_guc.c | 2 +- 9 files changed, 28 insertions(+), 27 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c index e2e42b3e0d5d..3b69bc6616bd 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_fw.c @@ -298,7 +298,7 @@ static int gsc_fw_load_prepare(struct intel_gsc_uc *gsc) memcpy_toio(gsc->local_vaddr, src, gsc->fw.size); memset_io(gsc->local_vaddr + gsc->fw.size, 0, gsc->local->size - gsc->fw.size); - intel_guc_write_barrier(>uc.guc); + intel_guc_write_barrier(gt_to_guc(gt)); i915_gem_object_unpin_map(gsc->fw.obj); @@ -351,7 +351,7 @@ static int gsc_fw_query_compatibility_version(struct intel_gsc_uc *gsc) void *vaddr; int err; - err = intel_guc_allocate_and_map_vma(>uc.guc, GSC_VER_PKT_SZ * 2, + err = intel_guc_allocate_and_map_vma(gt_to_guc(gt), GSC_VER_PKT_SZ * 2, , ); if (err) { gt_err(gt, "failed to allocate vma for GSC version query\n"); diff --git a/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c index 40817ebcca71..a7d5465655f9 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_gsc_proxy.c @@ -358,7 +358,8 @@ static int proxy_channel_alloc(struct intel_gsc_uc *gsc) void *vaddr; int err; - err = intel_guc_allocate_and_map_vma(>uc.guc, GSC_PROXY_CHANNEL_SIZE, + err = intel_guc_allocate_and_map_vma(gt_to_guc(gt), +GSC_PROXY_CHANNEL_SIZE, , ); if (err) return err; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c index 63724e17829a..1ef470e64604 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ads.c @@ -956,7 +956,7 @@ u32 intel_guc_engine_usage_offset(struct intel_guc *guc) struct iosys_map intel_guc_engine_usage_record_map(struct intel_engine_cs *engine) { - struct intel_guc *guc = >gt->uc.guc; + struct intel_guc *guc = gt_to_guc(engine->gt); u8 guc_class = engine_class_to_guc_class(engine->class); size_t offset = offsetof(struct __guc_ads_blob, engine_usage.engines[guc_class][ilog2(engine->logical_mask)]); diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c index a4da0208c883..84a8807391c5 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c @@ -1441,7 +1441,7 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf, if (!cap || !ee->engine) return -ENODEV; - guc = >engine->gt->uc.guc; + guc = gt_to_guc(ee->engine->gt); i915_error_printf(ebuf, "global --- GuC Error Capture on %s command stream:\n", ee->engine->name); @@ -1543,7 +1543,7 @@ bool intel_guc_capture_is_matching_engine(struct intel_gt *gt, if (!gt || !ce || !engine) return false; - guc = >uc.guc; + guc = gt_to_guc(gt); if (!guc->capture) return false; @@ -1573,7 +1573,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt, if (!gt || !ee || !ce) return; - guc = >uc.guc; + guc = gt_to_guc(gt); if (!guc->capture) return; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c index cc9569af7f0c..b67a15f74276 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_hwconfig.c @@ -111,7 +111,7 @@ static bool has_table(struct drm_i915_private *i915) static int guc_hwconfig_init(struct intel_gt *gt) { struct intel_hwconfig *hwconfig = >info.hwconfig; - struct intel_guc *guc = >uc.guc; + struct intel_guc *guc = gt_to_guc(gt); int ret; if (!has_
Re: [PATCH v3 1/4] drm/i915/gt: Create the gt_to_guc() wrapper
On 12/6/2023 9:46 PM, Andi Shyti wrote: We already have guc_to_gt() and getting to guc from the GT it requires some mental effort. Add the gt_to_guc(). Given the reference to the "gt", the gt_to_guc() will return the pinter to the "guc". Update all the files under the gt/ directory. Signed-off-by: Andi Shyti Reviewed-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_ggtt.c | 9 +++-- drivers/gpu/drm/i915/gt/intel_gt.h| 5 + drivers/gpu/drm/i915/gt/intel_gt_irq.c| 6 +++--- drivers/gpu/drm/i915/gt/intel_gt_pm_debugfs.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt_sysfs_pm.c | 8 drivers/gpu/drm/i915/gt/intel_rc6.c | 4 ++-- drivers/gpu/drm/i915/gt/intel_rps.c | 2 +- drivers/gpu/drm/i915/gt/intel_tlb.c | 2 +- drivers/gpu/drm/i915/gt/selftest_slpc.c | 6 +++--- 10 files changed, 25 insertions(+), 23 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 40687806d22a..bede7f09d4af 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -589,7 +589,7 @@ u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, u64 value) * NB: The GuC API only supports 32bit values. However, the limit is further * reduced due to internal calculations which would otherwise overflow. */ - if (intel_guc_submission_is_wanted(>gt->uc.guc)) + if (intel_guc_submission_is_wanted(gt_to_guc(engine->gt))) value = min_t(u64, value, guc_policy_max_preempt_timeout_ms()); value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT)); @@ -610,7 +610,7 @@ u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs *engine, u64 value) * NB: The GuC API only supports 32bit values. However, the limit is further * reduced due to internal calculations which would otherwise overflow. */ - if (intel_guc_submission_is_wanted(>gt->uc.guc)) + if (intel_guc_submission_is_wanted(gt_to_guc(engine->gt))) value = min_t(u64, value, guc_policy_max_exec_quantum_ms()); value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT)); diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index 21a7e3191c18..aa1e9249d393 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -230,11 +230,8 @@ static void guc_ggtt_ct_invalidate(struct intel_gt *gt) struct intel_uncore *uncore = gt->uncore; intel_wakeref_t wakeref; - with_intel_runtime_pm_if_active(uncore->rpm, wakeref) { - struct intel_guc *guc = >uc.guc; - - intel_guc_invalidate_tlb_guc(guc); - } + with_intel_runtime_pm_if_active(uncore->rpm, wakeref) + intel_guc_invalidate_tlb_guc(gt_to_guc(gt)); } static void guc_ggtt_invalidate(struct i915_ggtt *ggtt) @@ -245,7 +242,7 @@ static void guc_ggtt_invalidate(struct i915_ggtt *ggtt) gen8_ggtt_invalidate(ggtt); list_for_each_entry(gt, >gt_list, ggtt_link) { - if (intel_guc_tlb_invalidation_is_available(>uc.guc)) + if (intel_guc_tlb_invalidation_is_available(gt_to_guc(gt))) guc_ggtt_ct_invalidate(gt); else if (GRAPHICS_VER(i915) >= 12) intel_uncore_write_fw(gt->uncore, diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h index b0e453e27ea8..d7c859039828 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.h +++ b/drivers/gpu/drm/i915/gt/intel_gt.h @@ -118,6 +118,11 @@ static inline struct intel_gt *gsc_to_gt(struct intel_gsc *gsc) return container_of(gsc, struct intel_gt, gsc); } +static inline struct intel_guc *gt_to_guc(struct intel_gt *gt) +{ + return >uc.guc; +} + void intel_gt_common_init_early(struct intel_gt *gt); int intel_root_gt_init_early(struct drm_i915_private *i915); int intel_gt_assign_ggtt(struct intel_gt *gt); diff --git a/drivers/gpu/drm/i915/gt/intel_gt_irq.c b/drivers/gpu/drm/i915/gt/intel_gt_irq.c index 77fb57223465..ad4c51f18d3a 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_irq.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_irq.c @@ -68,9 +68,9 @@ gen11_other_irq_handler(struct intel_gt *gt, const u8 instance, struct intel_gt *media_gt = gt->i915->media_gt; if (instance == OTHER_GUC_INSTANCE) - return guc_irq_handler(>uc.guc, iir); + return guc_irq_handler(gt_to_guc(gt), iir); if (instance == OTHER_MEDIA_GUC_INSTANCE && media_gt) - return guc_irq_handler(_gt->uc.guc, iir); + return guc_irq_handler(gt_to_guc(media_gt), iir); if (instance == OTHER_GTPM_INSTANC
Re: [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
Hi John, On 12/5/2023 8:50 PM, John Harrison wrote: On 12/5/2023 02:39, Nirmoy Das wrote: Hi John, On 12/5/2023 10:10 AM, John Harrison wrote: On 12/5/2023 00:52, Nirmoy Das wrote: gen8_engine_reset_prepare() can fail when HW fails to set RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal error as driver will retry. Convert the log to a trace log for debugging without triggering unnecessary concerns in CI or for end-users during non-fatal scenarios. I strongly disagree with this change. The hardware spec for the RESET_CTL and GDRST registers are that they will self clear within a matter of microseconds. If something is so badly wrong with the hardware that it can't even manage to reset This message is for reset readiness poll timeout not that the reset is failed which doesn't sound so serious if the subsequent attempt managed reset the engine. Not sure what the distinction is. The reset procedure is poke RESET_CTL wait for it to clear, poke GDRST and wait for it to clear. Just because step one is failing rather than step 2 does not mean that the reset as a whole has not failed. Note that the purpose of RESET_CTL is to pause a bunch of stuff like the command streamers to prevent them from issuing new memory requests while the reset is in progress. If it fails, it likely means that a CS is refusing to stop. Most probably because it can't reach a stopping point because it is stuck waiting on a lost memory request or similar. And the point of stopping further memory requests during reset is that if the memory channel gets out of sync (because only the GT side is reset during a GT reset) then that can result in total system failure. As in potentially even the CPU can no longer get to memory if it is an integrated platform. So yes, it can be quite a serious failure indeed. Thanks bspec didn't explain those details. My intention was to acknowledge that engine reset is a complicated process which why the driver retries and don't spook CI/user if subsequent reset works but I get your objection on this. I couldn't get enough details when this can happen that HW takes very long time to set the readiness bit. Is it simply 'taking a long time' or is never clearing at all? If it is just that the timeout is too short then the proper fix would be to increase the timeout. But if it is taking seconds or longer or just never succeeding at all, then something is very bad. I tried with 10x timeout without any help so I think the CS is stuck though re-try works. I will try to get more details from HW team on this issue. then that is something that very much warrants more than a completely silent trace event. It most certainly should be flagged as a failure in CI. Just because the driver will retry does not mean that this is not a serious error. And if the first attempt failed, why would a subsequent attempt succeed? The patch is not ignoring the failure. If the subsequent attempt fails then driver load will fail or it will be wedged if that happens after driver load. One thing I really hate about our driver is the total lack of information when something goes wrong during load. The driver wedges in total silence. There are many error paths that have no reporting at all. Which means you are left with a totally useless bug report. Escalating to FLR may have more success, but that is not something that i915 currently does. Do we still need to do FLR if a subsequent engine reset failure ? Assuming that we are talking about modern(ish) platforms, an engine reset failure would be hit by GuC rather than i915, but that would be escalated to an i915 based full GT reset. Generally speaking though, if the engine reset fails the GT reset isn't going to do much better. It would fix a dead GuC problem but it can't help with memory related issues. If the full GT reset fails then we are out of escalation routes as there is no FLR path at present (I think we have that at driver unload on MTL but not for general reset?). The FLR resets a lot more than just the GT, so it does have a chance to fix some issues that a GT reset can't. After driver-level FLR, there is PCI level FLR. Not sure if that involves a full power down and restart, but if not then that would be the last escalation possible. A power cycle really should fix any issues, if it doesn't then it's time to return the system as being totally dead! My recollection is that the vast majority of engine reset failures I've looked at have been completely catastrophic and the system only recovered after a reboot. I.e. after the card was power cycled. Such issues were generally caused by bad memory. Once the path to memory has died, there really is not much of the GPU that can do anything at all and there isn't much that can be done to recover it. Thanks, Nirmoy John. Regards, Nirmoy John. v2: Improve commit message(Tvrtko) Cc: Tvrtko Ursulin Cc: John Harrison
Re: [Intel-gfx] [PATCH] drm/i915/gt: Reduce log severity on reset prepare.
Hi Tvrtko, On 12/5/2023 11:05 AM, Tvrtko Ursulin wrote: On 05/12/2023 08:50, Nirmoy Das wrote: Hi Tvrtko, On 12/5/2023 9:34 AM, Tvrtko Ursulin wrote: On 01/12/2023 15:44, Nirmoy Das wrote: gen8_engine_reset_prepare() can fail when HW fails to set RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal error as driver will retry. Let the caller of gen8_engine_reset_prepare() decide if a failure in gen8_engine_reset_prepare is an error or not. No complaints per se but I don't see the caller deciding and it is not really reducing log level but converting to trace. So commit message and patch do not align for me which I think should be improved. I meant the return value is checked by the caller, gen8_reset_engines(). I will resend with a improved commit message. Ah okay, maybe my bad for not figuring out that possibility. I guess it might be passable as is, but yes, clearer commit text would be better. I sent a v2 already :) Trace is good enough - we are not usually interested in seeing those as dbg/info/notice? Idea is that all the GT related events are recorded in trace and dmesg could be noisy some times. Regards, Nirmoy Regards, Tvrtko Thanks, Nirmoy Regards, Tvrtko Cc: Tvrtko Ursulin Cc: John Harrison Cc: Andi Shyti Cc: Andrzej Hajda Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591 Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_reset.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index d5ed904f355d..e6fbc6202c80 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct intel_engine_cs *engine) ret = __intel_wait_for_register_fw(uncore, reg, mask, ack, 700, 0, NULL); if (ret) - gt_err(engine->gt, - "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", - engine->name, request, - intel_uncore_read_fw(uncore, reg)); + GT_TRACE(engine->gt, + "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", + engine->name, request, + intel_uncore_read_fw(uncore, reg)); return ret; }
Re: [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
Hi John, On 12/5/2023 10:10 AM, John Harrison wrote: On 12/5/2023 00:52, Nirmoy Das wrote: gen8_engine_reset_prepare() can fail when HW fails to set RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal error as driver will retry. Convert the log to a trace log for debugging without triggering unnecessary concerns in CI or for end-users during non-fatal scenarios. I strongly disagree with this change. The hardware spec for the RESET_CTL and GDRST registers are that they will self clear within a matter of microseconds. If something is so badly wrong with the hardware that it can't even manage to reset This message is for reset readiness poll timeout not that the reset is failed which doesn't sound so serious if the subsequent attempt managed reset the engine. I couldn't get enough details when this can happen that HW takes very long time to set the readiness bit. then that is something that very much warrants more than a completely silent trace event. It most certainly should be flagged as a failure in CI. Just because the driver will retry does not mean that this is not a serious error. And if the first attempt failed, why would a subsequent attempt succeed? The patch is not ignoring the failure. If the subsequent attempt fails then driver load will fail or it will be wedged if that happens after driver load. Escalating to FLR may have more success, but that is not something that i915 currently does. Do we still need to do FLR if a subsequent engine reset failure ? Regards, Nirmoy John. v2: Improve commit message(Tvrtko) Cc: Tvrtko Ursulin Cc: John Harrison Cc: Andi Shyti Cc: Andrzej Hajda Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591 Signed-off-by: Nirmoy Das Reviewed-by: Andi Shyti Reviewed-by: Andrzej Hajda --- drivers/gpu/drm/i915/gt/intel_reset.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index d5ed904f355d..e6fbc6202c80 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct intel_engine_cs *engine) ret = __intel_wait_for_register_fw(uncore, reg, mask, ack, 700, 0, NULL); if (ret) - gt_err(engine->gt, - "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", - engine->name, request, - intel_uncore_read_fw(uncore, reg)); + GT_TRACE(engine->gt, + "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", + engine->name, request, + intel_uncore_read_fw(uncore, reg)); return ret; }
[PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
gen8_engine_reset_prepare() can fail when HW fails to set RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal error as driver will retry. Convert the log to a trace log for debugging without triggering unnecessary concerns in CI or for end-users during non-fatal scenarios. v2: Improve commit message(Tvrtko) Cc: Tvrtko Ursulin Cc: John Harrison Cc: Andi Shyti Cc: Andrzej Hajda Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591 Signed-off-by: Nirmoy Das Reviewed-by: Andi Shyti Reviewed-by: Andrzej Hajda --- drivers/gpu/drm/i915/gt/intel_reset.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index d5ed904f355d..e6fbc6202c80 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct intel_engine_cs *engine) ret = __intel_wait_for_register_fw(uncore, reg, mask, ack, 700, 0, NULL); if (ret) - gt_err(engine->gt, - "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", - engine->name, request, - intel_uncore_read_fw(uncore, reg)); + GT_TRACE(engine->gt, +"%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", +engine->name, request, +intel_uncore_read_fw(uncore, reg)); return ret; } -- 2.42.0
Re: [Intel-gfx] [PATCH] drm/i915/gt: Reduce log severity on reset prepare.
Hi Tvrtko, On 12/5/2023 9:34 AM, Tvrtko Ursulin wrote: On 01/12/2023 15:44, Nirmoy Das wrote: gen8_engine_reset_prepare() can fail when HW fails to set RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal error as driver will retry. Let the caller of gen8_engine_reset_prepare() decide if a failure in gen8_engine_reset_prepare is an error or not. No complaints per se but I don't see the caller deciding and it is not really reducing log level but converting to trace. So commit message and patch do not align for me which I think should be improved. I meant the return value is checked by the caller, gen8_reset_engines(). I will resend with a improved commit message. Thanks, Nirmoy Regards, Tvrtko Cc: Tvrtko Ursulin Cc: John Harrison Cc: Andi Shyti Cc: Andrzej Hajda Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591 Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_reset.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index d5ed904f355d..e6fbc6202c80 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct intel_engine_cs *engine) ret = __intel_wait_for_register_fw(uncore, reg, mask, ack, 700, 0, NULL); if (ret) - gt_err(engine->gt, - "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", - engine->name, request, - intel_uncore_read_fw(uncore, reg)); + GT_TRACE(engine->gt, + "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", + engine->name, request, + intel_uncore_read_fw(uncore, reg)); return ret; }
[PATCH] drm/i915/gt: Reduce log severity on reset prepare.
gen8_engine_reset_prepare() can fail when HW fails to set RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal error as driver will retry. Let the caller of gen8_engine_reset_prepare() decide if a failure in gen8_engine_reset_prepare is an error or not. Cc: Tvrtko Ursulin Cc: John Harrison Cc: Andi Shyti Cc: Andrzej Hajda Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591 Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_reset.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index d5ed904f355d..e6fbc6202c80 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct intel_engine_cs *engine) ret = __intel_wait_for_register_fw(uncore, reg, mask, ack, 700, 0, NULL); if (ret) - gt_err(engine->gt, - "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", - engine->name, request, - intel_uncore_read_fw(uncore, reg)); + GT_TRACE(engine->gt, +"%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n", +engine->name, request, +intel_uncore_read_fw(uncore, reg)); return ret; } -- 2.42.0
Re: [PATCH v3] drm/i915: Flush WC GGTT only on required platforms
This is now merged to gt-next Thanks, Nirmoy On 10/18/2023 4:04 PM, Nirmoy Das wrote: On 10/18/2023 3:00 PM, Andi Shyti wrote: Hi Nirmoy, gen8_ggtt_invalidate() is only needed for limited set of platforms where GGTT is mapped as WC. This was added as way to fix WC based GGTT in commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and there are no reference in HW docs that forces us to use this on non-WC backed GGTT. This can also cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid anymore. v2: Add a func to detect wc ggtt detection (Ville) v3: Improve commit log and add reference commit (Daniel) Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") I'm wondering if this is the right Fixes, though. Should this rather be: Fixes: 6266992cf105 ("drm/i915/gt: remove GRAPHICS_VER == 10") Hard to find a real Fixes for this. I just want to backport this to dg2 where we can have unwanted side-effects. yes, this piece of code has moved around enough so to make it diffuclt to track its origin. I think the one I found should be the correct one, That just removes a graphics ver, not related to WC GGTT map or XE_HP. but the dg2 force probe removeal can also become a placeholder for DG2 fixes. Yes, I have no better ideas too. Regards, Nirmoy I won't complain. Andi
Re: [Intel-gfx] [PATCH v2 1/5] drm/i915/guc: Create the guc_to_i915() wrapper
On 10/25/2023 4:53 PM, Andi Shyti wrote: Hi Nirmoy, +static inline struct drm_i915_private *guc_to_i915(struct intel_guc *guc) +{ + return guc_to_gt(guc)->i915; +} + We don't want inline functions in headers files[1]. Otherwise the series looks fine to me: the reason for that patch is that we were including the i915_drv.h just for that inline function and we were doing it inside the gt/. In this patch I am not changing any header dependency. Sounds fair, thanks, Nirmoy I guess the original idea from Matt was to have a generic network of intel_gt_needs_wa_xxx(), but it didn't develop further. Reviewed-by: Nirmoy Das Thanks, Andi
Re: [Intel-gfx] [PATCH v2 1/5] drm/i915/guc: Create the guc_to_i915() wrapper
Hi Andi, On 10/25/2023 4:35 PM, Andi Shyti wrote: Given a reference to "guc", the guc_to_i915() returns the pointer to "i915" private data. Signed-off-by: Andi Shyti --- drivers/gpu/drm/i915/gt/intel_gt.h| 5 + drivers/gpu/drm/i915/gt/uc/intel_guc.c| 2 +- drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c| 2 +- drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_guc_log.c| 10 +- drivers/gpu/drm/i915/gt/uc/intel_guc_rc.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_guc_slpc.c | 2 +- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++-- 8 files changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h index 970bedf6b78a..12a638f05d63 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.h +++ b/drivers/gpu/drm/i915/gt/intel_gt.h @@ -114,6 +114,11 @@ static inline struct intel_gt *gsc_to_gt(struct intel_gsc *gsc) return container_of(gsc, struct intel_gt, gsc); } +static inline struct drm_i915_private *guc_to_i915(struct intel_guc *guc) +{ + return guc_to_gt(guc)->i915; +} + We don't want inline functions in headers files[1]. Otherwise the series looks fine to me: Reviewed-by: Nirmoy Das [1] https://patchwork.freedesktop.org/series/124069/ Regards, Nirmoy void intel_gt_common_init_early(struct intel_gt *gt); int intel_root_gt_init_early(struct drm_i915_private *i915); int intel_gt_assign_ggtt(struct intel_gt *gt); diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc.c b/drivers/gpu/drm/i915/gt/uc/intel_guc.c index 3f3df1166b86..2b450c43bbd7 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc.c @@ -330,7 +330,7 @@ static u32 guc_ctl_wa_flags(struct intel_guc *guc) static u32 guc_ctl_devid(struct intel_guc *guc) { - struct drm_i915_private *i915 = guc_to_gt(guc)->i915; + struct drm_i915_private *i915 = guc_to_i915(guc); return (INTEL_DEVID(i915) << 16) | INTEL_REVID(i915); } diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c index a4da0208c883..a1cd40d80517 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c @@ -355,7 +355,7 @@ guc_capture_alloc_steered_lists(struct intel_guc *guc, static const struct __guc_mmio_reg_descr_group * guc_capture_get_device_reglist(struct intel_guc *guc) { - struct drm_i915_private *i915 = guc_to_gt(guc)->i915; + struct drm_i915_private *i915 = guc_to_i915(guc); const struct __guc_mmio_reg_descr_group *lists; if (GRAPHICS_VER(i915) >= 12) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c index 89e314b3756b..4e147de5118f 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c @@ -265,7 +265,7 @@ int intel_guc_ct_init(struct intel_guc_ct *ct) u32 *cmds; int err; - err = i915_inject_probe_error(guc_to_gt(guc)->i915, -ENXIO); + err = i915_inject_probe_error(guc_to_i915(guc), -ENXIO); if (err) return err; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c index 55bc8b55fbc0..bf16351c9349 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c @@ -520,7 +520,7 @@ void intel_guc_log_init_early(struct intel_guc_log *log) static int guc_log_relay_create(struct intel_guc_log *log) { struct intel_guc *guc = log_to_guc(log); - struct drm_i915_private *i915 = guc_to_gt(guc)->i915; + struct drm_i915_private *i915 = guc_to_i915(guc); struct rchan *guc_log_relay_chan; size_t n_subbufs, subbuf_size; int ret; @@ -573,7 +573,7 @@ static void guc_log_relay_destroy(struct intel_guc_log *log) static void guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log) { struct intel_guc *guc = log_to_guc(log); - struct drm_i915_private *i915 = guc_to_gt(guc)->i915; + struct drm_i915_private *i915 = guc_to_i915(guc); intel_wakeref_t wakeref; _guc_log_copy_debuglogs_for_relay(log); @@ -589,7 +589,7 @@ static void guc_log_copy_debuglogs_for_relay(struct intel_guc_log *log) static u32 __get_default_log_level(struct intel_guc_log *log) { struct intel_guc *guc = log_to_guc(log); - struct drm_i915_private *i915 = guc_to_gt(guc)->i915; + struct drm_i915_private *i915 = guc_to_i915(guc); /* A negative value means "use platform/config default" */ if (i915->params.guc_log_level < 0) { @@ -664,7 +664,7 @@ void intel_guc_log_destroy(struct intel_guc_log *log) int intel_guc_log_set_level(struct intel_guc_log *log, u32 level)
[PATCH] drm/i915/gt: Use proper priority enum instead of 0
I915_PRIORITY_NORMAL is 0 so use that instead for better readability. Cc: John Harrison Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c index 9a527e1f5be6..1a8e2b7db013 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c @@ -188,7 +188,7 @@ static void heartbeat(struct work_struct *wrk) * low latency and no jitter] the chance to naturally * complete before being preempted. */ - attr.priority = 0; + attr.priority = I915_PRIORITY_NORMAL; if (rq->sched.attr.priority >= attr.priority) attr.priority = I915_PRIORITY_HEARTBEAT; if (rq->sched.attr.priority >= attr.priority) -- 2.41.0
Re: [PATCH v3] drm/i915: Flush WC GGTT only on required platforms
On 10/18/2023 3:00 PM, Andi Shyti wrote: Hi Nirmoy, gen8_ggtt_invalidate() is only needed for limited set of platforms where GGTT is mapped as WC. This was added as way to fix WC based GGTT in commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and there are no reference in HW docs that forces us to use this on non-WC backed GGTT. This can also cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid anymore. v2: Add a func to detect wc ggtt detection (Ville) v3: Improve commit log and add reference commit (Daniel) Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") I'm wondering if this is the right Fixes, though. Should this rather be: Fixes: 6266992cf105 ("drm/i915/gt: remove GRAPHICS_VER == 10") Hard to find a real Fixes for this. I just want to backport this to dg2 where we can have unwanted side-effects. yes, this piece of code has moved around enough so to make it diffuclt to track its origin. I think the one I found should be the correct one, That just removes a graphics ver, not related to WC GGTT map or XE_HP. but the dg2 force probe removeal can also become a placeholder for DG2 fixes. Yes, I have no better ideas too. Regards, Nirmoy I won't complain. Andi
Re: [PATCH v3] drm/i915: Flush WC GGTT only on required platforms
Hi Andi, On 10/18/2023 1:49 PM, Andi Shyti wrote: Hi Nirmoy, On Wed, Oct 18, 2023 at 11:38:15AM +0200, Nirmoy Das wrote: gen8_ggtt_invalidate() is only needed for limited set of platforms where GGTT is mapped as WC. This was added as way to fix WC based GGTT in commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and there are no reference in HW docs that forces us to use this on non-WC backed GGTT. This can also cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid anymore. v2: Add a func to detect wc ggtt detection (Ville) v3: Improve commit log and add reference commit (Daniel) Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") I'm wondering if this is the right Fixes, though. Should this rather be: Fixes: 6266992cf105 ("drm/i915/gt: remove GRAPHICS_VER == 10") Hard to find a real Fixes for this. I just want to backport this to dg2 where we can have unwanted side-effects. Regards, Nirmoy ? Andi
Re: [PATCH v2] drm/i915: Prevent potential null-ptr-deref in engine_init_common
This now merged. CI errors are unrelated. On 10/11/2023 2:25 PM, Nirmoy Das wrote: If measure_breadcrumb_dw() returns an error and bce isn't created, this commit ensures that intel_engine_destroy_pinned_context() is not called with a NULL bce. v2: Fix the subject s/UAF/null-ptr-deref(Jani) Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates") Cc: Oak Zeng Cc: Andi Shyti Cc: Jani Nikula Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 179d9546865b..4a11219e560e 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs *engine) return 0; err_bce_context: - intel_engine_destroy_pinned_context(bce); + if (bce) + intel_engine_destroy_pinned_context(bce); err_ce_context: intel_engine_destroy_pinned_context(ce); return ret;
[PATCH v3] drm/i915: Flush WC GGTT only on required platforms
gen8_ggtt_invalidate() is only needed for limited set of platforms where GGTT is mapped as WC. This was added as way to fix WC based GGTT in commit 0f9b91c754b7 ("drm/i915: flush system agent TLBs on SNB") and there are no reference in HW docs that forces us to use this on non-WC backed GGTT. This can also cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid anymore. v2: Add a func to detect wc ggtt detection (Ville) v3: Improve commit log and add reference commit (Daniel) Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: Ville Syrjälä Cc: Daniel Vetter Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das Reviewed-by: Matt Roper Reviewed-by: Andi Shyti --- drivers/gpu/drm/i915/gt/intel_ggtt.c | 35 +++- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index 1c93e84278a0..15fc8e4703f4 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -195,6 +195,21 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt) spin_unlock_irq(>lock); } +static bool needs_wc_ggtt_mapping(struct drm_i915_private *i915) +{ + /* +* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range +* will be dropped. For WC mappings in general we have 64 byte burst +* writes when the WC buffer is flushed, so we can't use it, but have to +* resort to an uncached mapping. The WC issue is easily caught by the +* readback check when writing GTT PTE entries. +*/ + if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11) + return true; + + return false; +} + static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) { struct intel_uncore *uncore = ggtt->vm.gt->uncore; @@ -202,8 +217,12 @@ static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) /* * Note that as an uncached mmio write, this will flush the * WCB of the writes into the GGTT before it triggers the invalidate. +* +* Only perform this when GGTT is mapped as WC, see ggtt_probe_common(). */ - intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); + if (needs_wc_ggtt_mapping(ggtt->vm.i915)) + intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, + GFX_FLSH_CNTL_EN); } static void guc_ggtt_ct_invalidate(struct intel_gt *gt) @@ -1140,17 +1159,11 @@ static int ggtt_probe_common(struct i915_ggtt *ggtt, u64 size) GEM_WARN_ON(pci_resource_len(pdev, GEN4_GTTMMADR_BAR) != gen6_gttmmadr_size(i915)); phys_addr = pci_resource_start(pdev, GEN4_GTTMMADR_BAR) + gen6_gttadr_offset(i915); - /* -* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range -* will be dropped. For WC mappings in general we have 64 byte burst -* writes when the WC buffer is flushed, so we can't use it, but have to -* resort to an uncached mapping. The WC issue is easily caught by the -* readback check when writing GTT PTE entries. -*/ - if (IS_GEN9_LP(i915) || GRAPHICS_VER(i915) >= 11) - ggtt->gsm = ioremap(phys_addr, size); - else + if (needs_wc_ggtt_mapping(i915)) ggtt->gsm = ioremap_wc(phys_addr, size); + else + ggtt->gsm = ioremap(phys_addr, size); + if (!ggtt->gsm) { drm_err(>drm, "Failed to map the ggtt page table\n"); return -ENOMEM; -- 2.41.0
Re: [Intel-gfx] [PATCH] drm/i915/mtl: Don't set PIPE_CONTROL_FLUSH_L3
On 10/17/2023 4:42 PM, Andi Shyti wrote: Hi Vinay, This bit does not cause an explicit L3 flush. We already use At all? Or only on newer hardware? And as a genuine spec change or as a bug / workaround? If the hardware has re-purposed the bit then it is probably worth at least adding a comment to the bit definition to say that it is only valid up to IP version 12.70. At this point, this is a bug on MTL since this bit is not related to L3 flushes as per spec. Regarding older platforms, still checking the reason why this was added (i.e if it fixed something and will regress if removed). If not, we can extend the change for others as well in a separate patch. On older platforms, this bit seems to cause an implicit flush at best. PIPE_CONTROL_DC_FLUSH_ENABLE for that purpose. Cc: Nirmoy Das Cc: Mikka Kuoppala Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c index ba4c2422b340..abbc02f3e66e 100644 --- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c @@ -247,6 +247,7 @@ static int mtl_dummy_pipe_control(struct i915_request *rq) int gen12_emit_flush_rcs(struct i915_request *rq, u32 mode) { struct intel_engine_cs *engine = rq->engine; + struct intel_gt *gt = rq->engine->gt; /* * On Aux CCS platforms the invalidation of the Aux @@ -278,7 +279,8 @@ int gen12_emit_flush_rcs(struct i915_request *rq, u32 mode) * deals with Protected Memory which is not needed for * AUX CCS invalidation and lead to unwanted side effects. */ - if (mode & EMIT_FLUSH) + if ((mode & EMIT_FLUSH) && + !(IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71 Why stop at 12.71? Is the meaning only changed for 12.70 and the old/correct version will be restored in later hardware? Was trying to keep this limited to MTL for now until the above statements are verified. I'm not fully conviced here... this is not what the hardware spec says. Am I reading the specs wrong? The main issue is we are using side-effects of that bit as to flush L3 but that is not it's primary task. Unless there is a WA specially mentioned for MTL to use that bit to flush L3, I see no reason to use on MTL or further. Regards, Nirmoy Is there any ongoing discussion with the hardware developers? Andi
Re: [Intel-gfx] [PATCH] drm/i915/mtl: Don't set PIPE_CONTROL_FLUSH_L3
On 10/17/2023 2:23 AM, Belgaumkar, Vinay wrote: On 10/16/2023 4:24 PM, John Harrison wrote: On 10/16/2023 15:55, Vinay Belgaumkar wrote: This bit does not cause an explicit L3 flush. We already use At all? Or only on newer hardware? And as a genuine spec change or as a bug / workaround? If the hardware has re-purposed the bit then it is probably worth at least adding a comment to the bit definition to say that it is only valid up to IP version 12.70. At this point, this is a bug on MTL since this bit is not related to L3 flushes as per spec. Regarding older platforms, still checking the reason why this was added (i.e if it fixed something and will regress if removed). If not, we can extend the change for others as well in a separate patch. On older platforms, this bit seems to cause an implicit flush at best. PIPE_CONTROL_DC_FLUSH_ENABLE for that purpose. Cc: Nirmoy Das Cc: Mikka Kuoppala Signed-off-by: Vinay Belgaumkar --- drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c index ba4c2422b340..abbc02f3e66e 100644 --- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c @@ -247,6 +247,7 @@ static int mtl_dummy_pipe_control(struct i915_request *rq) int gen12_emit_flush_rcs(struct i915_request *rq, u32 mode) { struct intel_engine_cs *engine = rq->engine; + struct intel_gt *gt = rq->engine->gt; /* * On Aux CCS platforms the invalidation of the Aux @@ -278,7 +279,8 @@ int gen12_emit_flush_rcs(struct i915_request *rq, u32 mode) * deals with Protected Memory which is not needed for * AUX CCS invalidation and lead to unwanted side effects. */ - if (mode & EMIT_FLUSH) + if ((mode & EMIT_FLUSH) && + !(IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71 Why stop at 12.71? Is the meaning only changed for 12.70 and the old/correct version will be restored in later hardware? Was trying to keep this limited to MTL for now until the above statements are verified. If we don't need it for MTL, we are most likely not going to need it after MTL too. It should be fine if you make it >= MTL until we get more clarity about lower platforms. With that, this is Reviewed-by: Nirmoy Das Thanks, Vinay. John. bit_group_1 |= PIPE_CONTROL_FLUSH_L3; bit_group_1 |= PIPE_CONTROL_TILE_CACHE_FLUSH; @@ -812,12 +814,14 @@ u32 *gen12_emit_fini_breadcrumb_rcs(struct i915_request *rq, u32 *cs) u32 flags = (PIPE_CONTROL_CS_STALL | PIPE_CONTROL_TLB_INVALIDATE | PIPE_CONTROL_TILE_CACHE_FLUSH | - PIPE_CONTROL_FLUSH_L3 | PIPE_CONTROL_RENDER_TARGET_CACHE_FLUSH | PIPE_CONTROL_DEPTH_CACHE_FLUSH | PIPE_CONTROL_DC_FLUSH_ENABLE | PIPE_CONTROL_FLUSH_ENABLE); + if (!(IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71 + flags |= PIPE_CONTROL_FLUSH_L3; + /* Wa_14016712196 */ if (IS_GFX_GT_IP_RANGE(gt, IP_VER(12, 70), IP_VER(12, 71)) || IS_DG2(i915)) /* dummy PIPE_CONTROL + depth flush */
Re: [PATCH v2] drm/i915: Flush WC GGTT only on required platforms
Hi Andi, On 10/14/2023 10:51 AM, Andi Shyti wrote: Hi Nirmoy, On Fri, Oct 13, 2023 at 03:44:39PM +0200, Nirmoy Das wrote: gen8_ggtt_invalidate() is only needed for limited set of platforms where GGTT is mapped as WC otherwise this can cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid. v2: Add a func to detect wc ggtt detection (Ville) Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: Ville Syrjälä Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das Acked-by: Andi Shyti I took some time to look at this and you can swap the a-b with an r-b: Reviewed-by: Andi Shyti Thanks! Going to resend one more rev with commit that is started using this register for WC mapping. Regards, Nirmoy Thanks, Andi
Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms
On 10/13/2023 6:15 PM, Daniel Vetter wrote: On Fri, Oct 13, 2023 at 02:28:21PM +0200, Nirmoy Das wrote: Hi Ville, On 10/13/2023 12:50 PM, Ville Syrjälä wrote: On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote: gen8_ggtt_invalidate() is only needed for limitted set of platforms where GGTT is mapped as WC I know there is supposed to be some kind hw snooping of the ggtt pte writes to invalidate the tlb, but are we sure GFX_FLSH_CNTL has no other side effects we depend on? I spent some time searching through the gfxspec. This GFX_FLSH_CNTL register only seems to be for invalidating TLB for GUnit and (from git log ) we started to do that to enable WC based GGTT updates. Might be good to cite the relevant git commits in the commit message to make this clear. Yes, I should. It took me a while to find it. Going to add that and resend the patch. Thanks, Nirmoy -Sima So if I am not missing anything obvious then this should be safe. Regards, Nirmoy otherwise this can cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid. Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index 4d7d88b92632..c2858d434bce 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt) static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) { + struct drm_i915_private *i915 = ggtt->vm.i915; struct intel_uncore *uncore = ggtt->vm.gt->uncore; /* * Note that as an uncached mmio write, this will flush the * WCB of the writes into the GGTT before it triggers the invalidate. +* +* Only perform this when GGTT is mapped as WC, see ggtt_probe_common(). */ - intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); + if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11) + intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); } static void guc_ggtt_invalidate(struct i915_ggtt *ggtt) -- 2.41.0
[PATCH v2] drm/i915: Flush WC GGTT only on required platforms
gen8_ggtt_invalidate() is only needed for limited set of platforms where GGTT is mapped as WC otherwise this can cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid. v2: Add a func to detect wc ggtt detection (Ville) Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: Ville Syrjälä Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das Acked-by: Andi Shyti --- drivers/gpu/drm/i915/gt/intel_ggtt.c | 35 +++- 1 file changed, 24 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index 4d7d88b92632..401667f83f96 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -195,6 +195,21 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt) spin_unlock_irq(>lock); } +static bool needs_wc_ggtt_mapping(struct drm_i915_private *i915) +{ + /* +* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range +* will be dropped. For WC mappings in general we have 64 byte burst +* writes when the WC buffer is flushed, so we can't use it, but have to +* resort to an uncached mapping. The WC issue is easily caught by the +* readback check when writing GTT PTE entries. +*/ + if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11) + return true; + + return false; +} + static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) { struct intel_uncore *uncore = ggtt->vm.gt->uncore; @@ -202,8 +217,12 @@ static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) /* * Note that as an uncached mmio write, this will flush the * WCB of the writes into the GGTT before it triggers the invalidate. +* +* Only perform this when GGTT is mapped as WC, see ggtt_probe_common(). */ - intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); + if (needs_wc_ggtt_mapping(ggtt->vm.i915)) + intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, + GFX_FLSH_CNTL_EN); } static void guc_ggtt_invalidate(struct i915_ggtt *ggtt) @@ -1126,17 +1145,11 @@ static int ggtt_probe_common(struct i915_ggtt *ggtt, u64 size) GEM_WARN_ON(pci_resource_len(pdev, GEN4_GTTMMADR_BAR) != gen6_gttmmadr_size(i915)); phys_addr = pci_resource_start(pdev, GEN4_GTTMMADR_BAR) + gen6_gttadr_offset(i915); - /* -* On BXT+/ICL+ writes larger than 64 bit to the GTT pagetable range -* will be dropped. For WC mappings in general we have 64 byte burst -* writes when the WC buffer is flushed, so we can't use it, but have to -* resort to an uncached mapping. The WC issue is easily caught by the -* readback check when writing GTT PTE entries. -*/ - if (IS_GEN9_LP(i915) || GRAPHICS_VER(i915) >= 11) - ggtt->gsm = ioremap(phys_addr, size); - else + if (needs_wc_ggtt_mapping(i915)) ggtt->gsm = ioremap_wc(phys_addr, size); + else + ggtt->gsm = ioremap(phys_addr, size); + if (!ggtt->gsm) { drm_err(>drm, "Failed to map the ggtt page table\n"); return -ENOMEM; -- 2.41.0
Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms
On 10/13/2023 2:55 PM, Ville Syrjälä wrote: On Fri, Oct 13, 2023 at 02:28:21PM +0200, Nirmoy Das wrote: Hi Ville, On 10/13/2023 12:50 PM, Ville Syrjälä wrote: On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote: gen8_ggtt_invalidate() is only needed for limitted set of platforms where GGTT is mapped as WC I know there is supposed to be some kind hw snooping of the ggtt pte writes to invalidate the tlb, but are we sure GFX_FLSH_CNTL has no other side effects we depend on? I spent some time searching through the gfxspec. This GFX_FLSH_CNTL register only seems to be for invalidating TLB for GUnit and (from git log ) we started to do that to enable WC based GGTT updates. So if I am not missing anything obvious then this should be safe. OK. The only code related complaint I have is that you are now duplicating that same platform check in two different places. It's always better to have a single point of truth instead of two or more, so that there is no risk of introducing bugs due to mismatches. I agree. I will resend with a static helper function to detect that. Thanks, Nirmoy Regards, Nirmoy otherwise this can cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid. Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index 4d7d88b92632..c2858d434bce 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt) static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) { + struct drm_i915_private *i915 = ggtt->vm.i915; struct intel_uncore *uncore = ggtt->vm.gt->uncore; /* * Note that as an uncached mmio write, this will flush the * WCB of the writes into the GGTT before it triggers the invalidate. +* +* Only perform this when GGTT is mapped as WC, see ggtt_probe_common(). */ - intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); + if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11) + intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); } static void guc_ggtt_invalidate(struct i915_ggtt *ggtt) -- 2.41.0
Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms
Hi Ville, On 10/13/2023 12:50 PM, Ville Syrjälä wrote: On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote: gen8_ggtt_invalidate() is only needed for limitted set of platforms where GGTT is mapped as WC I know there is supposed to be some kind hw snooping of the ggtt pte writes to invalidate the tlb, but are we sure GFX_FLSH_CNTL has no other side effects we depend on? I spent some time searching through the gfxspec. This GFX_FLSH_CNTL register only seems to be for invalidating TLB for GUnit and (from git log ) we started to do that to enable WC based GGTT updates. So if I am not missing anything obvious then this should be safe. Regards, Nirmoy otherwise this can cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid. Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index 4d7d88b92632..c2858d434bce 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt) static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) { + struct drm_i915_private *i915 = ggtt->vm.i915; struct intel_uncore *uncore = ggtt->vm.gt->uncore; /* * Note that as an uncached mmio write, this will flush the * WCB of the writes into the GGTT before it triggers the invalidate. +* +* Only perform this when GGTT is mapped as WC, see ggtt_probe_common(). */ - intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); + if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11) + intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); } static void guc_ggtt_invalidate(struct i915_ggtt *ggtt) -- 2.41.0
Re: [PATCH] drm/i915: Flush WC GGTT only on required platforms
On 10/13/2023 12:35 PM, Andi Shyti wrote: Hi Nirmoy, On Fri, Oct 13, 2023 at 12:31:40PM +0200, Nirmoy Das wrote: gen8_ggtt_invalidate() is only needed for limitted set of platforms /limitted/limited/ Added " autocmd FileType gitcommit setlocal spell" to my vim config. Wish I knew about it before. where GGTT is mapped as WC otherwise this can cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid. Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das Acked-by: Andi Shyti Thanks, Nirmoy Andi
[PATCH] drm/i915: Flush WC GGTT only on required platforms
gen8_ggtt_invalidate() is only needed for limitted set of platforms where GGTT is mapped as WC otherwise this can cause unwanted side-effects on XE_HP platforms where GFX_FLSH_CNTL_GEN6 is not valid. Fixes: d2eae8e98d59 ("drm/i915/dg2: Drop force_probe requirement") Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Joonas Lahtinen Cc: Jani Nikula Cc: Jonathan Cavitt Cc: John Harrison Cc: Andi Shyti Cc: # v6.2+ Suggested-by: Matt Roper Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_ggtt.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_ggtt.c b/drivers/gpu/drm/i915/gt/intel_ggtt.c index 4d7d88b92632..c2858d434bce 100644 --- a/drivers/gpu/drm/i915/gt/intel_ggtt.c +++ b/drivers/gpu/drm/i915/gt/intel_ggtt.c @@ -197,13 +197,17 @@ void gen6_ggtt_invalidate(struct i915_ggtt *ggtt) static void gen8_ggtt_invalidate(struct i915_ggtt *ggtt) { + struct drm_i915_private *i915 = ggtt->vm.i915; struct intel_uncore *uncore = ggtt->vm.gt->uncore; /* * Note that as an uncached mmio write, this will flush the * WCB of the writes into the GGTT before it triggers the invalidate. +* +* Only perform this when GGTT is mapped as WC, see ggtt_probe_common(). */ - intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); + if (!IS_GEN9_LP(i915) && GRAPHICS_VER(i915) < 11) + intel_uncore_write_fw(uncore, GFX_FLSH_CNTL_GEN6, GFX_FLSH_CNTL_EN); } static void guc_ggtt_invalidate(struct i915_ggtt *ggtt) -- 2.41.0
[PATCH v2] drm/i915: Prevent potential null-ptr-deref in engine_init_common
If measure_breadcrumb_dw() returns an error and bce isn't created, this commit ensures that intel_engine_destroy_pinned_context() is not called with a NULL bce. v2: Fix the subject s/UAF/null-ptr-deref(Jani) Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates") Cc: Oak Zeng Cc: Andi Shyti Cc: Jani Nikula Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 179d9546865b..4a11219e560e 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs *engine) return 0; err_bce_context: - intel_engine_destroy_pinned_context(bce); + if (bce) + intel_engine_destroy_pinned_context(bce); err_ce_context: intel_engine_destroy_pinned_context(ce); return ret; -- 2.41.0
Re: [PATCH] drm/i915: Prevent potential UAF in engine_init_common
Hi Andi, On 10/11/2023 2:22 PM, Andi Shyti wrote: Hi Nirmoy, On Wed, Oct 11, 2023 at 01:54:51PM +0200, Nirmoy Das wrote: If measure_breadcrumb_dw() returns an error and bce isn't created, this commit ensures that intel_engine_destroy_pinned_context() is not called with a NULL bce. Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates") Cc: Oak Zeng Cc: Andi Shyti Signed-off-by: Nirmoy Das Reviewed-by: Andi Shyti Resent with a fixed subject. Please check again. Thanks, Nirmoy Andi
Re: [PATCH] drm/i915: Prevent potential UAF in engine_init_common
On 10/11/2023 2:20 PM, Jani Nikula wrote: On Wed, 11 Oct 2023, Nirmoy Das wrote: If measure_breadcrumb_dw() returns an error and bce isn't created, this commit ensures that intel_engine_destroy_pinned_context() is not called with a NULL bce. So it's a potential NULL pointer dereference, not use after free like the subject says. Please fix the subject. ah right. I will resend. Thanks, Nirmoy BR, Jani. Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates") Cc: Oak Zeng Cc: Andi Shyti Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 179d9546865b..4a11219e560e 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs *engine) return 0; err_bce_context: - intel_engine_destroy_pinned_context(bce); + if (bce) + intel_engine_destroy_pinned_context(bce); err_ce_context: intel_engine_destroy_pinned_context(ce); return ret;
[PATCH] drm/i915: Prevent potential UAF in engine_init_common
If measure_breadcrumb_dw() returns an error and bce isn't created, this commit ensures that intel_engine_destroy_pinned_context() is not called with a NULL bce. Fixes: b35274993680 ("drm/i915: Create a kernel context for GGTT updates") Cc: Oak Zeng Cc: Andi Shyti Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 179d9546865b..4a11219e560e 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -1491,7 +1491,8 @@ static int engine_init_common(struct intel_engine_cs *engine) return 0; err_bce_context: - intel_engine_destroy_pinned_context(bce); + if (bce) + intel_engine_destroy_pinned_context(bce); err_ce_context: intel_engine_destroy_pinned_context(ce); return ret; -- 2.41.0
Re: [Intel-gfx] [PATCH v2] drm/i915: Reduce MCR lock surface
Hi Rodrigo, On 10/4/2023 4:37 PM, Rodrigo Vivi wrote: On Wed, Oct 04, 2023 at 03:54:59PM +0200, Nirmoy Das wrote: Hi Rodrigo, On 10/4/2023 2:44 PM, Rodrigo Vivi wrote: On Wed, Oct 04, 2023 at 02:04:07PM +0200, Nirmoy Das wrote: Take the mcr lock only when driver needs to write into a mcr based tlb based registers. To prevent GT reset interference, employ gt->reset.mutex instead, since intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held. This looks a lot like protecting code and not protecting data [1] But to be really honest I'm afraid we were already doing this before this patch but with 2 other locks instead. I haven't thought about that but yes, the issue was there already. [1] - https://blog.ffwll.ch/2022/07/locking-engineering.html v2: remove unused var, flags. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_tlb.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c b/drivers/gpu/drm/i915/gt/intel_tlb.c index 139608c30d97..0ad905df4a98 100644 --- a/drivers/gpu/drm/i915/gt/intel_tlb.c +++ b/drivers/gpu/drm/i915/gt/intel_tlb.c @@ -52,15 +52,13 @@ static void mmio_invalidate_full(struct intel_gt *gt) struct intel_engine_cs *engine; intel_engine_mask_t awake, tmp; enum intel_engine_id id; - unsigned long flags; if (GRAPHICS_VER(i915) < 8) return; intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL); - intel_gt_mcr_lock(gt, ); - spin_lock(>lock); /* serialise invalidate with GT reset */ + mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */ I'm still looking at this and the commit message above and trying to understand why we are doing this and changing the previous 2 by this other one. why? We need the MCR lock only for intel_gt_mcr_multicast_*() so I am not replacing the two locks here but moving the mcr lock down where we were doing intel_gt_mcr_multicast_write_fw() why s/spin_lock(>lock)/mutex_lock(>reset.mutex): intel_gt_mcr_multicast_*() expects gt->uncore->lock to be not held is there any lockdep assert or primitive that we could/should do that to avoid this same issue in the future? We have locdep asserts for those mcr functions. anyway, this is also another thing that it is important for the commit message. and why is that? what I have in mind goes along with the comment above intel_de_read_fw(): """ Access to registers should * therefore generally be serialised, by either the dev_priv->uncore.lock or """ Yes, the commit message should've been more clear. Anyways, please ignore this patch. I need to find a better way and it also didn't fix the issue completely that I was working on. Thanks, Nirmoy and to achieve this, I could do something like: if (engine->tlb_inv.mcr) { spin_unlock(>lock); intel_gt_mcr_lock(gt, ); intel_gt_mcr_multicast_write_fw intel_gt_mcr_unlock(gt, flags); spin_lock(>lock); } Or take gt->reset.mutex instead which should block any concurrent gt reset. If this is not acceptable then I can pick the above 1st option but I am not sure how safe is it do release uncore->lock and then take it back again. hmm... probably the gt_reset one is better than releasing and grabbing it again. awake = 0; for_each_engine(engine, gt, id) { @@ -68,9 +66,9 @@ static void mmio_invalidate_full(struct intel_gt *gt) continue; if (engine->tlb_inv.mcr) - intel_gt_mcr_multicast_write_fw(gt, - engine->tlb_inv.reg.mcr_reg, - engine->tlb_inv.request); + intel_gt_mcr_multicast_write(gt, + engine->tlb_inv.reg.mcr_reg, +engine->tlb_inv.request); you are already taking the forcewake_all domain above, so you wouldn't need to convert this to the variant that grabs the forcewake underneath. Also this is not mentioned in the commit message above. intel_gt_mcr_multicast_write() takes the mcr lock for us, helps replacing multiple lines into one. Will there be any side-effects for that ? hmm... I can't forsee side-effects here... but I'm asking myself why on the non MCR ones we are using the global forcewake_all and the _fw to start with. Maybe there was a reason for that? Because in general we should prefer the non _fw variants to start with. Maybe we should dig into the history there to understand why the line below started with the intel_uncore_write_fw below? I should've added that the commit message. I'm even wondering if this should be 2 separated patches?! Regards, Nirmoy else intel_
Re: [PATCH v2] drm/i915: Reduce MCR lock surface
Hi Rodrigo, On 10/4/2023 2:44 PM, Rodrigo Vivi wrote: On Wed, Oct 04, 2023 at 02:04:07PM +0200, Nirmoy Das wrote: Take the mcr lock only when driver needs to write into a mcr based tlb based registers. To prevent GT reset interference, employ gt->reset.mutex instead, since intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held. This looks a lot like protecting code and not protecting data [1] But to be really honest I'm afraid we were already doing this before this patch but with 2 other locks instead. I haven't thought about that but yes, the issue was there already. [1] - https://blog.ffwll.ch/2022/07/locking-engineering.html v2: remove unused var, flags. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_tlb.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c b/drivers/gpu/drm/i915/gt/intel_tlb.c index 139608c30d97..0ad905df4a98 100644 --- a/drivers/gpu/drm/i915/gt/intel_tlb.c +++ b/drivers/gpu/drm/i915/gt/intel_tlb.c @@ -52,15 +52,13 @@ static void mmio_invalidate_full(struct intel_gt *gt) struct intel_engine_cs *engine; intel_engine_mask_t awake, tmp; enum intel_engine_id id; - unsigned long flags; if (GRAPHICS_VER(i915) < 8) return; intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL); - intel_gt_mcr_lock(gt, ); - spin_lock(>lock); /* serialise invalidate with GT reset */ + mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */ I'm still looking at this and the commit message above and trying to understand why we are doing this and changing the previous 2 by this other one. why? We need the MCR lock only for intel_gt_mcr_multicast_*() so I am not replacing the two locks here but moving the mcr lock down where we were doing intel_gt_mcr_multicast_write_fw() why s/spin_lock(>lock)/mutex_lock(>reset.mutex): intel_gt_mcr_multicast_*() expects gt->uncore->lock to be not held and to achieve this, I could do something like: if (engine->tlb_inv.mcr) { spin_unlock(>lock); intel_gt_mcr_lock(gt, ); intel_gt_mcr_multicast_write_fw intel_gt_mcr_unlock(gt, flags); spin_lock(>lock); } Or take gt->reset.mutex instead which should block any concurrent gt reset. If this is not acceptable then I can pick the above 1st option but I am not sure how safe is it do release uncore->lock and then take it back again. awake = 0; for_each_engine(engine, gt, id) { @@ -68,9 +66,9 @@ static void mmio_invalidate_full(struct intel_gt *gt) continue; if (engine->tlb_inv.mcr) - intel_gt_mcr_multicast_write_fw(gt, - engine->tlb_inv.reg.mcr_reg, - engine->tlb_inv.request); + intel_gt_mcr_multicast_write(gt, + engine->tlb_inv.reg.mcr_reg, +engine->tlb_inv.request); you are already taking the forcewake_all domain above, so you wouldn't need to convert this to the variant that grabs the forcewake underneath. Also this is not mentioned in the commit message above. intel_gt_mcr_multicast_write() takes the mcr lock for us, helps replacing multiple lines into one. Will there be any side-effects for that ? I should've added that the commit message. Regards, Nirmoy else intel_uncore_write_fw(uncore, engine->tlb_inv.reg.reg, @@ -90,8 +88,7 @@ static void mmio_invalidate_full(struct intel_gt *gt) IS_ALDERLAKE_P(i915))) intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1); - spin_unlock(>lock); - intel_gt_mcr_unlock(gt, flags); + mutex_unlock(>reset.mutex); for_each_engine_masked(engine, gt, awake, tmp) { if (wait_for_invalidate(engine)) -- 2.41.0
[PATCH v2] drm/i915: Reduce MCR lock surface
Take the mcr lock only when driver needs to write into a mcr based tlb based registers. To prevent GT reset interference, employ gt->reset.mutex instead, since intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held. v2: remove unused var, flags. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_tlb.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c b/drivers/gpu/drm/i915/gt/intel_tlb.c index 139608c30d97..0ad905df4a98 100644 --- a/drivers/gpu/drm/i915/gt/intel_tlb.c +++ b/drivers/gpu/drm/i915/gt/intel_tlb.c @@ -52,15 +52,13 @@ static void mmio_invalidate_full(struct intel_gt *gt) struct intel_engine_cs *engine; intel_engine_mask_t awake, tmp; enum intel_engine_id id; - unsigned long flags; if (GRAPHICS_VER(i915) < 8) return; intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL); - intel_gt_mcr_lock(gt, ); - spin_lock(>lock); /* serialise invalidate with GT reset */ + mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */ awake = 0; for_each_engine(engine, gt, id) { @@ -68,9 +66,9 @@ static void mmio_invalidate_full(struct intel_gt *gt) continue; if (engine->tlb_inv.mcr) - intel_gt_mcr_multicast_write_fw(gt, - engine->tlb_inv.reg.mcr_reg, - engine->tlb_inv.request); + intel_gt_mcr_multicast_write(gt, + engine->tlb_inv.reg.mcr_reg, +engine->tlb_inv.request); else intel_uncore_write_fw(uncore, engine->tlb_inv.reg.reg, @@ -90,8 +88,7 @@ static void mmio_invalidate_full(struct intel_gt *gt) IS_ALDERLAKE_P(i915))) intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1); - spin_unlock(>lock); - intel_gt_mcr_unlock(gt, flags); + mutex_unlock(>reset.mutex); for_each_engine_masked(engine, gt, awake, tmp) { if (wait_for_invalidate(engine)) -- 2.41.0
[PATCH] drm/i915: Reduce MCR lock surface
Take the mcr lock only when driver needs to write into a mcr based tlb based registers. To prevent GT reset interference, employ gt->reset.mutex instead, since intel_gt_mcr_multicast_write relies on gt->uncore->lock not being held. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_tlb.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.c b/drivers/gpu/drm/i915/gt/intel_tlb.c index 139608c30d97..7b2d9549e595 100644 --- a/drivers/gpu/drm/i915/gt/intel_tlb.c +++ b/drivers/gpu/drm/i915/gt/intel_tlb.c @@ -59,8 +59,7 @@ static void mmio_invalidate_full(struct intel_gt *gt) intel_uncore_forcewake_get(uncore, FORCEWAKE_ALL); - intel_gt_mcr_lock(gt, ); - spin_lock(>lock); /* serialise invalidate with GT reset */ + mutex_lock(>reset.mutex);/* serialise invalidate with GT reset */ awake = 0; for_each_engine(engine, gt, id) { @@ -68,9 +67,9 @@ static void mmio_invalidate_full(struct intel_gt *gt) continue; if (engine->tlb_inv.mcr) - intel_gt_mcr_multicast_write_fw(gt, - engine->tlb_inv.reg.mcr_reg, - engine->tlb_inv.request); + intel_gt_mcr_multicast_write(gt, + engine->tlb_inv.reg.mcr_reg, +engine->tlb_inv.request); else intel_uncore_write_fw(uncore, engine->tlb_inv.reg.reg, @@ -90,8 +89,7 @@ static void mmio_invalidate_full(struct intel_gt *gt) IS_ALDERLAKE_P(i915))) intel_uncore_write_fw(uncore, GEN12_OA_TLB_INV_CR, 1); - spin_unlock(>lock); - intel_gt_mcr_unlock(gt, flags); + mutex_unlock(>reset.mutex); for_each_engine_masked(engine, gt, awake, tmp) { if (wait_for_invalidate(engine)) -- 2.41.0
Re: [PATCH] drm/i915: Invalidate all GTs in flush_tlb_invalidate()
Hi Andi, On 10/2/2023 3:45 PM, Andi Shyti wrote: Hi Nirmoy, On Mon, Oct 02, 2023 at 02:20:32PM +0200, Nirmoy Das wrote: Don't return early if one of the GT doesn't require any flushing. Fixes: d6c531ab4820 ("drm/i915: Invalidate the TLBs on each GT") Cc: Chris Wilson Cc: Fei Yang Cc: Mauro Carvalho Chehab Cc: Andi Shyti Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Nirmoy Das Cc: Matthew Auld Cc: "Ville Syrjälä" Cc: Andrzej Hajda Cc: "Christian König" Cc: Jonathan Cavitt Cc: Matt Roper Cc: intel-...@lists.freedesktop.org Signed-off-by: Nirmoy Das This patch has been sent already here: https://patchwork.freedesktop.org/patch/560309/?series=124472=1 Can we take it from there? Missed it. Sure, works for me. Thanks, Nirmoy Andi
Re: [Intel-gfx] [PATCH] drm/i915: Invalidate the TLBs on each GT
+dri-devel On 10/2/2023 4:07 PM, Jonathan Cavitt wrote: From: Chris Wilson With multi-GT devices, the object may have been bound on each GT and so we need to invalidate the TLBs across all GT before releasing the pages back to the system. Fixes: d6c531ab4820 ("drm/i915: Invalidate the TLBs on each GT") Signed-off-by: Chris Wilson Signed-off-by: Jonathan Cavitt CC: Matt Roper CC: Andi Shyti Reviewed-by: Andi Shyti Reviewed-by: Nirmoy Das Can't find it in dri-devel but if you didn't then Cc to dri-devel@lists.freedesktop.org for change that touches gt/gem files. Regards, Nirmoy --- drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c index 6b6d22c194110..0ba955611dfb5 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c @@ -198,7 +198,7 @@ static void flush_tlb_invalidate(struct drm_i915_gem_object *obj) for_each_gt(gt, i915, id) { if (!obj->mm.tlb[id]) - return; + continue; intel_gt_invalidate_tlb_full(gt, obj->mm.tlb[id]); obj->mm.tlb[id] = 0;
[PATCH] drm/i915: Invalidate all GTs in flush_tlb_invalidate()
Don't return early if one of the GT doesn't require any flushing. Fixes: d6c531ab4820 ("drm/i915: Invalidate the TLBs on each GT") Cc: Chris Wilson Cc: Fei Yang Cc: Mauro Carvalho Chehab Cc: Andi Shyti Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Nirmoy Das Cc: Matthew Auld Cc: "Ville Syrjälä" Cc: Andrzej Hajda Cc: "Christian König" Cc: Jonathan Cavitt Cc: Matt Roper Cc: intel-...@lists.freedesktop.org Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c index 6b6d22c19411..0ba955611dfb 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c @@ -198,7 +198,7 @@ static void flush_tlb_invalidate(struct drm_i915_gem_object *obj) for_each_gt(gt, i915, id) { if (!obj->mm.tlb[id]) - return; + continue; intel_gt_invalidate_tlb_full(gt, obj->mm.tlb[id]); obj->mm.tlb[id] = 0; -- 2.41.0
Re: [Intel-gfx] [PATCH v7 1/4] drm/i915: Introduce intel_gt_mcr_lock_sanitize()
On 9/28/2023 11:42 PM, Andrzej Hajda wrote: On 28.09.2023 15:00, Nirmoy Das wrote: Implement intel_gt_mcr_lock_sanitize() to provide a mechanism for cleaning the steer semaphore when absolutely necessary. v2: remove unnecessary lock(Andi, Matt) improve the kernel doc(Matt) s/intel_gt_mcr_lock_clear/intel_gt_mcr_lock_sanitize Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 22 ++ drivers/gpu/drm/i915/gt/intel_gt_mcr.h | 1 + 2 files changed, 23 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c index bf4a933de03a..326c2ed1d99b 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c @@ -419,6 +419,28 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags) intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); } +/** + * intel_gt_mcr_lock_sanitize - Sanitize MCR steering lock + * @gt: GT structure + * + * This will be used to sanitize the initial status of the hardware lock + * during driver load and resume since there won't be any concurrent access + * from other agents at those times, but it's possible that boot firmware + * may have left the lock in a bad state. + * + */ +void intel_gt_mcr_lock_sanitize(struct intel_gt *gt) +{ + /* + * This gets called at load/resume time, so we shouldn't be + * racing with other driver threads grabbing the mcr lock. + */ + lockdep_assert_not_held(>mcr_lock); + + if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70)) + intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); I wonder if it wouldn't be useful to check and report if it is locked before unconditional release, no strong feelings. Not so useful for user but may be as debug log if we need. Reviewed-by: Andrzej Hajda Thanks, Nirmoy Regards Andrzej +} + /** * intel_gt_mcr_read - read a specific instance of an MCR register * @gt: GT structure diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h index 41684495b7da..01ac565a56a4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h @@ -11,6 +11,7 @@ void intel_gt_mcr_init(struct intel_gt *gt); void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags); void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags); +void intel_gt_mcr_lock_sanitize(struct intel_gt *gt); u32 intel_gt_mcr_read(struct intel_gt *gt, i915_mcr_reg_t reg,
Re: [PATCH v7 1/4] drm/i915: Introduce intel_gt_mcr_lock_sanitize()
Thanks reviewing this series. Merged it in gt-next so hopefully we have bit greener CI for MTL now. Regards, Nirmoy On 9/28/2023 3:00 PM, Nirmoy Das wrote: Implement intel_gt_mcr_lock_sanitize() to provide a mechanism for cleaning the steer semaphore when absolutely necessary. v2: remove unnecessary lock(Andi, Matt) improve the kernel doc(Matt) s/intel_gt_mcr_lock_clear/intel_gt_mcr_lock_sanitize Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 22 ++ drivers/gpu/drm/i915/gt/intel_gt_mcr.h | 1 + 2 files changed, 23 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c index bf4a933de03a..326c2ed1d99b 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c @@ -419,6 +419,28 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags) intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); } +/** + * intel_gt_mcr_lock_sanitize - Sanitize MCR steering lock + * @gt: GT structure + * + * This will be used to sanitize the initial status of the hardware lock + * during driver load and resume since there won't be any concurrent access + * from other agents at those times, but it's possible that boot firmware + * may have left the lock in a bad state. + * + */ +void intel_gt_mcr_lock_sanitize(struct intel_gt *gt) +{ + /* +* This gets called at load/resume time, so we shouldn't be +* racing with other driver threads grabbing the mcr lock. +*/ + lockdep_assert_not_held(>mcr_lock); + + if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70)) + intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); +} + /** * intel_gt_mcr_read - read a specific instance of an MCR register * @gt: GT structure diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h index 41684495b7da..01ac565a56a4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h @@ -11,6 +11,7 @@ void intel_gt_mcr_init(struct intel_gt *gt); void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags); void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags); +void intel_gt_mcr_lock_sanitize(struct intel_gt *gt); u32 intel_gt_mcr_read(struct intel_gt *gt, i915_mcr_reg_t reg,
[PATCH v7 4/4] drm/i915/mtl: Skip MCR ops for ring fault register
On MTL GEN12_RING_FAULT_REG is not replicated so don't do mcr based operation for this register. v2: use MEDIA_VER() instead of GRAPHICS_VER()(Matt). v3: s/"MEDIA_VER(i915) == 13"/"MEDIA_VER(i915) >= 13"(Matt) improve comment. v4: improve the comment further(Andi) Signed-off-by: Nirmoy Das Reviewed-by: Matt Roper Reviewed-by: Andi Shyti Reviewed-by: Andrzej Hajda --- drivers/gpu/drm/i915/gt/intel_gt.c | 13 - drivers/gpu/drm/i915/gt/intel_gt_regs.h | 1 + drivers/gpu/drm/i915/i915_gpu_error.c | 11 ++- 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 93062c35e072..dff8bba1f5d4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -262,10 +262,21 @@ intel_gt_clear_error_registers(struct intel_gt *gt, I915_MASTER_ERROR_INTERRUPT); } - if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) { + /* +* For the media GT, this ring fault register is not replicated, +* so don't do multicast/replicated register read/write operation on it. +*/ + if (MEDIA_VER(i915) >= 13 && gt->type == GT_MEDIA) { + intel_uncore_rmw(uncore, XELPMP_RING_FAULT_REG, +RING_FAULT_VALID, 0); + intel_uncore_posting_read(uncore, + XELPMP_RING_FAULT_REG); + + } else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) { intel_gt_mcr_multicast_rmw(gt, XEHP_RING_FAULT_REG, RING_FAULT_VALID, 0); intel_gt_mcr_read_any(gt, XEHP_RING_FAULT_REG); + } else if (GRAPHICS_VER(i915) >= 12) { intel_uncore_rmw(uncore, GEN12_RING_FAULT_REG, RING_FAULT_VALID, 0); intel_uncore_posting_read(uncore, GEN12_RING_FAULT_REG); diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h index cca4bac8f8b0..eecd0a87a647 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h @@ -1084,6 +1084,7 @@ #define GEN12_RING_FAULT_REG _MMIO(0xcec4) #define XEHP_RING_FAULT_REGMCR_REG(0xcec4) +#define XELPMP_RING_FAULT_REG _MMIO(0xcec4) #define GEN8_RING_FAULT_ENGINE_ID(x) (((x) >> 12) & 0x7) #define RING_FAULT_GTTSEL_MASK (1 << 11) #define RING_FAULT_SRCID(x) (((x) >> 3) & 0xff) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index f4ebcfb70289..b4e31e59c799 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1234,7 +1234,16 @@ static void engine_record_registers(struct intel_engine_coredump *ee) if (GRAPHICS_VER(i915) >= 6) { ee->rc_psmi = ENGINE_READ(engine, RING_PSMI_CTL); - if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) + /* +* For the media GT, this ring fault register is not replicated, +* so don't do multicast/replicated register read/write +* operation on it. +*/ + if (MEDIA_VER(i915) >= 13 && engine->gt->type == GT_MEDIA) + ee->fault_reg = intel_uncore_read(engine->uncore, + XELPMP_RING_FAULT_REG); + + else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) ee->fault_reg = intel_gt_mcr_read_any(engine->gt, XEHP_RING_FAULT_REG); else if (GRAPHICS_VER(i915) >= 12) -- 2.41.0
[PATCH v7 3/4] drm/i915: Clean steer semaphore on resume
During resume, the steer semaphore on GT1 was observed to be held. The hardware team has confirmed the safety of clearing steer semaphores for all GTs during driver load/resume, as no lock acquisitions can occur in this process by other agents. v2: reset on resume not in intel_gt_init(). v3: do the reset on intel_gt_resume_early() v4: do general sanitization for all GTs(Matt) Signed-off-by: Nirmoy Das Reviewed-by: Andi Shyti --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index dab73980c9f1..3461f3e74277 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -13,6 +13,7 @@ #include "intel_engine_pm.h" #include "intel_gt.h" #include "intel_gt_clock_utils.h" +#include "intel_gt_mcr.h" #include "intel_gt_pm.h" #include "intel_gt_print.h" #include "intel_gt_requests.h" @@ -218,6 +219,15 @@ void intel_gt_pm_fini(struct intel_gt *gt) void intel_gt_resume_early(struct intel_gt *gt) { + /* +* Sanitize steer semaphores during driver resume. This is necessary +* to address observed cases of steer semaphores being +* held after a suspend operation. Confirmation from the hardware team +* assures the safety of this operation, as no lock acquisitions +* by other agents occur during driver load/resume process. +*/ + intel_gt_mcr_lock_sanitize(gt); + intel_uncore_resume_early(gt->uncore); intel_gt_check_and_clear_faults(gt); } -- 2.41.0
[PATCH v7 2/4] drm/i915: Introduce the intel_gt_resume_early()
Move early resume functions of gt to a proper file. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 6 ++ drivers/gpu/drm/i915/gt/intel_gt_pm.h | 1 + drivers/gpu/drm/i915/i915_driver.c| 6 ++ 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 5a942af0a14e..dab73980c9f1 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -216,6 +216,12 @@ void intel_gt_pm_fini(struct intel_gt *gt) intel_rc6_fini(>rc6); } +void intel_gt_resume_early(struct intel_gt *gt) +{ + intel_uncore_resume_early(gt->uncore); + intel_gt_check_and_clear_faults(gt); +} + int intel_gt_resume(struct intel_gt *gt) { struct intel_engine_cs *engine; diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h index 6c9a46452364..b1eeb5b33918 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h @@ -78,6 +78,7 @@ void intel_gt_pm_fini(struct intel_gt *gt); void intel_gt_suspend_prepare(struct intel_gt *gt); void intel_gt_suspend_late(struct intel_gt *gt); int intel_gt_resume(struct intel_gt *gt); +void intel_gt_resume_early(struct intel_gt *gt); void intel_gt_runtime_suspend(struct intel_gt *gt); int intel_gt_runtime_resume(struct intel_gt *gt); diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c index d50347e5773a..78501a83ba10 100644 --- a/drivers/gpu/drm/i915/i915_driver.c +++ b/drivers/gpu/drm/i915/i915_driver.c @@ -1327,10 +1327,8 @@ static int i915_drm_resume_early(struct drm_device *dev) drm_err(_priv->drm, "Resume prepare failed: %d, continuing anyway\n", ret); - for_each_gt(gt, dev_priv, i) { - intel_uncore_resume_early(gt->uncore); - intel_gt_check_and_clear_faults(gt); - } + for_each_gt(gt, dev_priv, i) + intel_gt_resume_early(gt); intel_display_power_resume_early(dev_priv); -- 2.41.0
[PATCH v7 1/4] drm/i915: Introduce intel_gt_mcr_lock_sanitize()
Implement intel_gt_mcr_lock_sanitize() to provide a mechanism for cleaning the steer semaphore when absolutely necessary. v2: remove unnecessary lock(Andi, Matt) improve the kernel doc(Matt) s/intel_gt_mcr_lock_clear/intel_gt_mcr_lock_sanitize Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 22 ++ drivers/gpu/drm/i915/gt/intel_gt_mcr.h | 1 + 2 files changed, 23 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c index bf4a933de03a..326c2ed1d99b 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c @@ -419,6 +419,28 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags) intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); } +/** + * intel_gt_mcr_lock_sanitize - Sanitize MCR steering lock + * @gt: GT structure + * + * This will be used to sanitize the initial status of the hardware lock + * during driver load and resume since there won't be any concurrent access + * from other agents at those times, but it's possible that boot firmware + * may have left the lock in a bad state. + * + */ +void intel_gt_mcr_lock_sanitize(struct intel_gt *gt) +{ + /* +* This gets called at load/resume time, so we shouldn't be +* racing with other driver threads grabbing the mcr lock. +*/ + lockdep_assert_not_held(>mcr_lock); + + if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70)) + intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); +} + /** * intel_gt_mcr_read - read a specific instance of an MCR register * @gt: GT structure diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h index 41684495b7da..01ac565a56a4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h @@ -11,6 +11,7 @@ void intel_gt_mcr_init(struct intel_gt *gt); void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags); void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags); +void intel_gt_mcr_lock_sanitize(struct intel_gt *gt); u32 intel_gt_mcr_read(struct intel_gt *gt, i915_mcr_reg_t reg, -- 2.41.0
Re: [PATCH] drm/i915: Don't set PIPE_CONTROL_FLUSH_L3 for aux inval
Hi Tapani, On 9/27/2023 6:13 AM, Tapani Pälli wrote: Fixes all regressions we saw, I also run some extra vulkan and GL workloads, no regressions observed. Tested-by: Tapani Pälli Thanks to testing it. The patch is now merged with" # v5.8+" tag so it should trickle down to v6.4.10. as normal stable release process. Thanks, Nirmoy On 26.9.2023 17.24, Nirmoy Das wrote: PIPE_CONTROL_FLUSH_L3 is not needed for aux invalidation so don't set that. Fixes: 78a6ccd65fa3 ("drm/i915/gt: Ensure memory quiesced before invalidation") Cc: Jonathan Cavitt Cc: Andi Shyti Cc: # v5.8+ Cc: Andrzej Hajda Cc: Tvrtko Ursulin Cc: Matt Roper Cc: Tejas Upadhyay Cc: Lucas De Marchi Cc: Prathap Kumar Valsan Cc: Tapani Pälli Cc: Mark Janes Cc: Rodrigo Vivi Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/gen8_engine_cs.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c index 0143445dba83..ba4c2422b340 100644 --- a/drivers/gpu/drm/i915/gt/gen8_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/gen8_engine_cs.c @@ -271,8 +271,17 @@ int gen12_emit_flush_rcs(struct i915_request *rq, u32 mode) if (GRAPHICS_VER_FULL(rq->i915) >= IP_VER(12, 70)) bit_group_0 |= PIPE_CONTROL_CCS_FLUSH; + /* + * L3 fabric flush is needed for AUX CCS invalidation + * which happens as part of pipe-control so we can + * ignore PIPE_CONTROL_FLUSH_L3. Also PIPE_CONTROL_FLUSH_L3 + * deals with Protected Memory which is not needed for + * AUX CCS invalidation and lead to unwanted side effects. + */ + if (mode & EMIT_FLUSH) + bit_group_1 |= PIPE_CONTROL_FLUSH_L3; + bit_group_1 |= PIPE_CONTROL_TILE_CACHE_FLUSH; - bit_group_1 |= PIPE_CONTROL_FLUSH_L3; bit_group_1 |= PIPE_CONTROL_RENDER_TARGET_CACHE_FLUSH; bit_group_1 |= PIPE_CONTROL_DEPTH_CACHE_FLUSH; /* Wa_1409600907:tgl,adl-p */
Re: [PATCH v6 2/4] drm/i915: Introduce the intel_gt_resume_early()
Hi Andi, On 9/28/2023 9:24 AM, Andi Shyti wrote: Hi Nirmoy, On Wed, Sep 27, 2023 at 11:03:55PM +0200, Nirmoy Das wrote: Move early resume functions of gt to a proper file. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 6 ++ drivers/gpu/drm/i915/gt/intel_gt_pm.h | 1 + drivers/gpu/drm/i915/i915_driver.c| 6 ++ 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 5a942af0a14e..dab73980c9f1 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -216,6 +216,12 @@ void intel_gt_pm_fini(struct intel_gt *gt) intel_rc6_fini(>rc6); } +void intel_gt_resume_early(struct intel_gt *gt) +{ + intel_uncore_resume_early(gt->uncore); + intel_gt_check_and_clear_faults(gt); +} + should this go into the gt/ directory? Don't wanted to add anything new to i915_driver which can be done at lower levels as suggested by Jani and this seems like it should go to gt dir. Besides, if we don't need spinlocks in the whole reset function, we could directly have the intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); here, so that we avoid having one line functions. I think steer specific functions should stay in one file/header. Thanks, Nirmoy Andi
Re: [PATCH v6 1/4] drm/i915: Introduce intel_gt_mcr_lock_reset()
On 9/28/2023 9:18 AM, Andi Shyti wrote: Hi Nirmoy, your client is still missing my e-mails? :) I did reply with a question! +void intel_gt_mcr_lock_reset(struct intel_gt *gt) +{ + unsigned long __flags; + + lockdep_assert_not_held(>uncore->lock); + + spin_lock_irqsave(>mcr_lock, __flags); + + if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70)) + intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); + + spin_unlock_irqrestore(>mcr_lock, __flags); I commented here that probably we don't need the locks. And I asked you whether you had any reason for locking here. I asked the question back to you and I send this before you could reply. Will remove the lock. Thanks, Nirmoy Andi +}
Re: [PATCH v6 1/4] drm/i915: Introduce intel_gt_mcr_lock_reset()
On 9/28/2023 12:23 AM, Matt Roper wrote: On Wed, Sep 27, 2023 at 11:03:54PM +0200, Nirmoy Das wrote: Implement intel_gt_mcr_lock_reset() to provide a mechanism for resetting the steer semaphore when absolutely necessary. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 29 ++ drivers/gpu/drm/i915/gt/intel_gt_mcr.h | 1 + 2 files changed, 30 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c index bf4a933de03a..d98e0d2fc2ee 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c @@ -419,6 +419,35 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags) intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); } +/** + * intel_gt_mcr_lock_reset - Reset MCR steering lock + * @gt: GT structure + * + * Performs a steer semaphore reset operation. On MTL and beyond, a hardware + * lock will also be taken to serialize access not only for the driver, + * but also for external hardware and firmware agents. The text here makes it sound like this reset function is going to take the lock. Since we have the same language in the lock() function's kerneldoc, I think you can just delete this whole sentence. + * However, there may be situations where the driver must reset the semaphore + * but only when it is absolutely certain that no other agent should own the + * lock at that given time. This part leads to questions about what such situations would be and how we'd know it's safe to use. Maybe it's best to just say something like "This will be used to sanitize the initial status of the hardware lock during driver load and resume since there won't be any concurrent access from other agents at those times, but it's possible that boot firmware may have left the lock in a bad state." sounds better than mine. I will just keep that as description and remove rest. + * + * Context: Takes gt->mcr_lock. uncore->lock should *not* be held when this + * function is called, although it may be acquired after this + * function call. + */ +void intel_gt_mcr_lock_reset(struct intel_gt *gt) +{ + unsigned long __flags; + + lockdep_assert_not_held(>uncore->lock); + + spin_lock_irqsave(>mcr_lock, __flags); If we're doing this to sanitize at load/resume, then presumably we shouldn't ever be racing with other driver threads either, right? Driver load and suspend/resume is single threaded afaik but I think I just needed double conformation. Will remove the lock. Thanks, Nirmoy If it was possible for some other thread to already be grabbing the MCR lock, then that would mean it also isn't safe for us to reset it here either. Matt + + if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70)) + intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); + + spin_unlock_irqrestore(>mcr_lock, __flags); +} + /** * intel_gt_mcr_read - read a specific instance of an MCR register * @gt: GT structure diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h index 41684495b7da..485c7711f2e8 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h @@ -11,6 +11,7 @@ void intel_gt_mcr_init(struct intel_gt *gt); void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags); void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags); +void intel_gt_mcr_lock_reset(struct intel_gt *gt); u32 intel_gt_mcr_read(struct intel_gt *gt, i915_mcr_reg_t reg, -- 2.41.0
Re: [PATCH v6 3/4] drm/i915: Reset steer semaphore for media GT on resume
On 9/28/2023 9:19 AM, Andi Shyti wrote: Hi Nirmoy, On Wed, Sep 27, 2023 at 11:03:56PM +0200, Nirmoy Das wrote: During resume, the steer semaphore on GT1 was observed to be held. The hardware team has confirmed the safety of clearing the steer semaphore during driver load/resume, as no lock acquisitions can occur in this process by other agents. v2: reset on resume not in intel_gt_init(). v3: do the reset on intel_gt_resume_early() Signed-off-by: Nirmoy Das In the previous version I added my r-b here. I moved code to different function so wanted to be sure :) Please consider it for the next version: Reviewed-by: Andi Shyti Even though there are still some quesions coming from Matt. I will make it generic for all GT in the next version. Thanks, Nirmoy Thanks, Andi --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index dab73980c9f1..59cebf205b72 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -13,6 +13,7 @@ #include "intel_engine_pm.h" #include "intel_gt.h" #include "intel_gt_clock_utils.h" +#include "intel_gt_mcr.h" #include "intel_gt_pm.h" #include "intel_gt_print.h" #include "intel_gt_requests.h" @@ -218,6 +219,17 @@ void intel_gt_pm_fini(struct intel_gt *gt) void intel_gt_resume_early(struct intel_gt *gt) { + /* +* Reset the steer semaphore on GT1, as we have observed it +* remaining held after a suspend operation. Confirmation +* from the hardware team ensures the safety of resetting +* the steer semaphore during driver load/resume, as there +* are no lock acquisitions during this process by other +* agents. +*/ + if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA) + intel_gt_mcr_lock_reset(gt); + intel_uncore_resume_early(gt->uncore); intel_gt_check_and_clear_faults(gt); } -- 2.41.0
Re: [PATCH v6 3/4] drm/i915: Reset steer semaphore for media GT on resume
On 9/28/2023 12:35 AM, Matt Roper wrote: On Wed, Sep 27, 2023 at 11:03:56PM +0200, Nirmoy Das wrote: During resume, the steer semaphore on GT1 was observed to be held. The hardware team has confirmed the safety of clearing the steer semaphore during driver load/resume, as no lock acquisitions can occur in this process by other agents. I guess the question is whether we just want to handle the one case where we've already seen a BIOS snapshot screw up (i.e., specifically on GT1 during resume), or do we want to make this a general sanitization that we do on both GTs at both load and resume, just to be safe? Given that the hardware team has indicated no external agents would be expected to be using steering at the point the driver is loading/resuming, maybe it's best to always do the sanitization on platforms that have a hardware semaphore? Agree, will make it got all GT. Thanks, Nirmoy Matt v2: reset on resume not in intel_gt_init(). v3: do the reset on intel_gt_resume_early() Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index dab73980c9f1..59cebf205b72 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -13,6 +13,7 @@ #include "intel_engine_pm.h" #include "intel_gt.h" #include "intel_gt_clock_utils.h" +#include "intel_gt_mcr.h" #include "intel_gt_pm.h" #include "intel_gt_print.h" #include "intel_gt_requests.h" @@ -218,6 +219,17 @@ void intel_gt_pm_fini(struct intel_gt *gt) void intel_gt_resume_early(struct intel_gt *gt) { + /* +* Reset the steer semaphore on GT1, as we have observed it +* remaining held after a suspend operation. Confirmation +* from the hardware team ensures the safety of resetting +* the steer semaphore during driver load/resume, as there +* are no lock acquisitions during this process by other +* agents. +*/ + if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA) + intel_gt_mcr_lock_reset(gt); + intel_uncore_resume_early(gt->uncore); intel_gt_check_and_clear_faults(gt); } -- 2.41.0
[PATCH v6 4/4] drm/i915/mtl: Skip MCR ops for ring fault register
On MTL GEN12_RING_FAULT_REG is not replicated so don't do mcr based operation for this register. v2: use MEDIA_VER() instead of GRAPHICS_VER()(Matt). v3: s/"MEDIA_VER(i915) == 13"/"MEDIA_VER(i915) >= 13"(Matt) improve comment. v4: improve the comment further(Andi) Signed-off-by: Nirmoy Das Reviewed-by: Matt Roper Reviewed-by: Andi Shyti Reviewed-by: Andrzej Hajda --- drivers/gpu/drm/i915/gt/intel_gt.c | 13 - drivers/gpu/drm/i915/gt/intel_gt_regs.h | 1 + drivers/gpu/drm/i915/i915_gpu_error.c | 11 ++- 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 93062c35e072..dff8bba1f5d4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -262,10 +262,21 @@ intel_gt_clear_error_registers(struct intel_gt *gt, I915_MASTER_ERROR_INTERRUPT); } - if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) { + /* +* For the media GT, this ring fault register is not replicated, +* so don't do multicast/replicated register read/write operation on it. +*/ + if (MEDIA_VER(i915) >= 13 && gt->type == GT_MEDIA) { + intel_uncore_rmw(uncore, XELPMP_RING_FAULT_REG, +RING_FAULT_VALID, 0); + intel_uncore_posting_read(uncore, + XELPMP_RING_FAULT_REG); + + } else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) { intel_gt_mcr_multicast_rmw(gt, XEHP_RING_FAULT_REG, RING_FAULT_VALID, 0); intel_gt_mcr_read_any(gt, XEHP_RING_FAULT_REG); + } else if (GRAPHICS_VER(i915) >= 12) { intel_uncore_rmw(uncore, GEN12_RING_FAULT_REG, RING_FAULT_VALID, 0); intel_uncore_posting_read(uncore, GEN12_RING_FAULT_REG); diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h index cca4bac8f8b0..eecd0a87a647 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h @@ -1084,6 +1084,7 @@ #define GEN12_RING_FAULT_REG _MMIO(0xcec4) #define XEHP_RING_FAULT_REGMCR_REG(0xcec4) +#define XELPMP_RING_FAULT_REG _MMIO(0xcec4) #define GEN8_RING_FAULT_ENGINE_ID(x) (((x) >> 12) & 0x7) #define RING_FAULT_GTTSEL_MASK (1 << 11) #define RING_FAULT_SRCID(x) (((x) >> 3) & 0xff) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index f4ebcfb70289..b4e31e59c799 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1234,7 +1234,16 @@ static void engine_record_registers(struct intel_engine_coredump *ee) if (GRAPHICS_VER(i915) >= 6) { ee->rc_psmi = ENGINE_READ(engine, RING_PSMI_CTL); - if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) + /* +* For the media GT, this ring fault register is not replicated, +* so don't do multicast/replicated register read/write +* operation on it. +*/ + if (MEDIA_VER(i915) >= 13 && engine->gt->type == GT_MEDIA) + ee->fault_reg = intel_uncore_read(engine->uncore, + XELPMP_RING_FAULT_REG); + + else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) ee->fault_reg = intel_gt_mcr_read_any(engine->gt, XEHP_RING_FAULT_REG); else if (GRAPHICS_VER(i915) >= 12) -- 2.41.0
[PATCH v6 3/4] drm/i915: Reset steer semaphore for media GT on resume
During resume, the steer semaphore on GT1 was observed to be held. The hardware team has confirmed the safety of clearing the steer semaphore during driver load/resume, as no lock acquisitions can occur in this process by other agents. v2: reset on resume not in intel_gt_init(). v3: do the reset on intel_gt_resume_early() Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index dab73980c9f1..59cebf205b72 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -13,6 +13,7 @@ #include "intel_engine_pm.h" #include "intel_gt.h" #include "intel_gt_clock_utils.h" +#include "intel_gt_mcr.h" #include "intel_gt_pm.h" #include "intel_gt_print.h" #include "intel_gt_requests.h" @@ -218,6 +219,17 @@ void intel_gt_pm_fini(struct intel_gt *gt) void intel_gt_resume_early(struct intel_gt *gt) { + /* +* Reset the steer semaphore on GT1, as we have observed it +* remaining held after a suspend operation. Confirmation +* from the hardware team ensures the safety of resetting +* the steer semaphore during driver load/resume, as there +* are no lock acquisitions during this process by other +* agents. +*/ + if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA) + intel_gt_mcr_lock_reset(gt); + intel_uncore_resume_early(gt->uncore); intel_gt_check_and_clear_faults(gt); } -- 2.41.0
[PATCH v6 2/4] drm/i915: Introduce the intel_gt_resume_early()
Move early resume functions of gt to a proper file. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 6 ++ drivers/gpu/drm/i915/gt/intel_gt_pm.h | 1 + drivers/gpu/drm/i915/i915_driver.c| 6 ++ 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 5a942af0a14e..dab73980c9f1 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -216,6 +216,12 @@ void intel_gt_pm_fini(struct intel_gt *gt) intel_rc6_fini(>rc6); } +void intel_gt_resume_early(struct intel_gt *gt) +{ + intel_uncore_resume_early(gt->uncore); + intel_gt_check_and_clear_faults(gt); +} + int intel_gt_resume(struct intel_gt *gt) { struct intel_engine_cs *engine; diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.h b/drivers/gpu/drm/i915/gt/intel_gt_pm.h index 6c9a46452364..b1eeb5b33918 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.h @@ -78,6 +78,7 @@ void intel_gt_pm_fini(struct intel_gt *gt); void intel_gt_suspend_prepare(struct intel_gt *gt); void intel_gt_suspend_late(struct intel_gt *gt); int intel_gt_resume(struct intel_gt *gt); +void intel_gt_resume_early(struct intel_gt *gt); void intel_gt_runtime_suspend(struct intel_gt *gt); int intel_gt_runtime_resume(struct intel_gt *gt); diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c index d50347e5773a..78501a83ba10 100644 --- a/drivers/gpu/drm/i915/i915_driver.c +++ b/drivers/gpu/drm/i915/i915_driver.c @@ -1327,10 +1327,8 @@ static int i915_drm_resume_early(struct drm_device *dev) drm_err(_priv->drm, "Resume prepare failed: %d, continuing anyway\n", ret); - for_each_gt(gt, dev_priv, i) { - intel_uncore_resume_early(gt->uncore); - intel_gt_check_and_clear_faults(gt); - } + for_each_gt(gt, dev_priv, i) + intel_gt_resume_early(gt); intel_display_power_resume_early(dev_priv); -- 2.41.0
[PATCH v6 1/4] drm/i915: Introduce intel_gt_mcr_lock_reset()
Implement intel_gt_mcr_lock_reset() to provide a mechanism for resetting the steer semaphore when absolutely necessary. Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 29 ++ drivers/gpu/drm/i915/gt/intel_gt_mcr.h | 1 + 2 files changed, 30 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c index bf4a933de03a..d98e0d2fc2ee 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c @@ -419,6 +419,35 @@ void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags) intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); } +/** + * intel_gt_mcr_lock_reset - Reset MCR steering lock + * @gt: GT structure + * + * Performs a steer semaphore reset operation. On MTL and beyond, a hardware + * lock will also be taken to serialize access not only for the driver, + * but also for external hardware and firmware agents. + * However, there may be situations where the driver must reset the semaphore + * but only when it is absolutely certain that no other agent should own the + * lock at that given time. + * + * Context: Takes gt->mcr_lock. uncore->lock should *not* be held when this + * function is called, although it may be acquired after this + * function call. + */ +void intel_gt_mcr_lock_reset(struct intel_gt *gt) +{ + unsigned long __flags; + + lockdep_assert_not_held(>uncore->lock); + + spin_lock_irqsave(>mcr_lock, __flags); + + if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70)) + intel_uncore_write_fw(gt->uncore, MTL_STEER_SEMAPHORE, 0x1); + + spin_unlock_irqrestore(>mcr_lock, __flags); +} + /** * intel_gt_mcr_read - read a specific instance of an MCR register * @gt: GT structure diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h index 41684495b7da..485c7711f2e8 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.h @@ -11,6 +11,7 @@ void intel_gt_mcr_init(struct intel_gt *gt); void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags); void intel_gt_mcr_unlock(struct intel_gt *gt, unsigned long flags); +void intel_gt_mcr_lock_reset(struct intel_gt *gt); u32 intel_gt_mcr_read(struct intel_gt *gt, i915_mcr_reg_t reg, -- 2.41.0
[PATCH v5 3/3] drm/i915/mtl: Skip MCR ops for ring fault register
On MTL GEN12_RING_FAULT_REG is not replicated so don't do mcr based operation for this register. v2: use MEDIA_VER() instead of GRAPHICS_VER()(Matt). v3: s/"MEDIA_VER(i915) == 13"/"MEDIA_VER(i915) >= 13"(Matt) improve comment. v4: improve the comment further(Andi) Signed-off-by: Nirmoy Das Reviewed-by: Matt Roper Reviewed-by: Andi Shyti Reviewed-by: Andrzej Hajda --- drivers/gpu/drm/i915/gt/intel_gt.c | 13 - drivers/gpu/drm/i915/gt/intel_gt_regs.h | 1 + drivers/gpu/drm/i915/i915_gpu_error.c | 11 ++- 3 files changed, 23 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 93062c35e072..dff8bba1f5d4 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -262,10 +262,21 @@ intel_gt_clear_error_registers(struct intel_gt *gt, I915_MASTER_ERROR_INTERRUPT); } - if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) { + /* +* For the media GT, this ring fault register is not replicated, +* so don't do multicast/replicated register read/write operation on it. +*/ + if (MEDIA_VER(i915) >= 13 && gt->type == GT_MEDIA) { + intel_uncore_rmw(uncore, XELPMP_RING_FAULT_REG, +RING_FAULT_VALID, 0); + intel_uncore_posting_read(uncore, + XELPMP_RING_FAULT_REG); + + } else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) { intel_gt_mcr_multicast_rmw(gt, XEHP_RING_FAULT_REG, RING_FAULT_VALID, 0); intel_gt_mcr_read_any(gt, XEHP_RING_FAULT_REG); + } else if (GRAPHICS_VER(i915) >= 12) { intel_uncore_rmw(uncore, GEN12_RING_FAULT_REG, RING_FAULT_VALID, 0); intel_uncore_posting_read(uncore, GEN12_RING_FAULT_REG); diff --git a/drivers/gpu/drm/i915/gt/intel_gt_regs.h b/drivers/gpu/drm/i915/gt/intel_gt_regs.h index cca4bac8f8b0..eecd0a87a647 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_regs.h +++ b/drivers/gpu/drm/i915/gt/intel_gt_regs.h @@ -1084,6 +1084,7 @@ #define GEN12_RING_FAULT_REG _MMIO(0xcec4) #define XEHP_RING_FAULT_REGMCR_REG(0xcec4) +#define XELPMP_RING_FAULT_REG _MMIO(0xcec4) #define GEN8_RING_FAULT_ENGINE_ID(x) (((x) >> 12) & 0x7) #define RING_FAULT_GTTSEL_MASK (1 << 11) #define RING_FAULT_SRCID(x) (((x) >> 3) & 0xff) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index f4ebcfb70289..b4e31e59c799 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1234,7 +1234,16 @@ static void engine_record_registers(struct intel_engine_coredump *ee) if (GRAPHICS_VER(i915) >= 6) { ee->rc_psmi = ENGINE_READ(engine, RING_PSMI_CTL); - if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) + /* +* For the media GT, this ring fault register is not replicated, +* so don't do multicast/replicated register read/write +* operation on it. +*/ + if (MEDIA_VER(i915) >= 13 && engine->gt->type == GT_MEDIA) + ee->fault_reg = intel_uncore_read(engine->uncore, + XELPMP_RING_FAULT_REG); + + else if (GRAPHICS_VER_FULL(i915) >= IP_VER(12, 50)) ee->fault_reg = intel_gt_mcr_read_any(engine->gt, XEHP_RING_FAULT_REG); else if (GRAPHICS_VER(i915) >= 12) -- 2.41.0
[PATCH v5 2/3] drm/i915: Reset steer semaphore for media GT on resume
During resume, the steer semaphore on GT1 was observed to be held. The hardware team has confirmed the safety of clearing the steer semaphore during driver load/resume, as no lock acquisitions can occur in this process by other agents. v2: reset on resume not in intel_gt_init(). Signed-off-by: Nirmoy Das --- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 5a942af0a14e..b19062e30b9b 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -13,6 +13,7 @@ #include "intel_engine_pm.h" #include "intel_gt.h" #include "intel_gt_clock_utils.h" +#include "intel_gt_mcr.h" #include "intel_gt_pm.h" #include "intel_gt_print.h" #include "intel_gt_requests.h" @@ -228,6 +229,17 @@ int intel_gt_resume(struct intel_gt *gt) GT_TRACE(gt, "\n"); + /* +* Reset the steer semaphore on GT1, as we have observed it +* remaining held after a suspend operation. Confirmation +* from the hardware team ensures the safety of resetting +* the steer semaphore during driver load/resume, as there +* are no lock acquisitions during this process by other +* agents. +*/ + if (MEDIA_VER(gt->i915) >= 13 && gt->type == GT_MEDIA) + intel_gt_mcr_lock_reset(gt); + /* * After resume, we may need to poke into the pinned kernel * contexts to paper over any damage caused by the sudden suspend. -- 2.41.0