Re: [PATCH v2] drm/xe/guc: Add GuC based register capture for error capture
Please ignore this patch The patch is for Xe upstream, sent to wrong ML. Regards, Zhanjun On 2024-01-16 12:12 p.m., Zhanjun Dong wrote: Port GuC based register capture for error capture from i915 to Xe. There are 3 parts in this commit: . Prepare for capture registers There is a bo create at guc ads init time, that is very early and engine map is not ready, make it hard to calculate the capture buffer size, new function created for worst case size caluation. Other than that, this part basically follows the i915 design. . Process capture notification message Basically follows i915 design . Sysfs command process. Xe switched to devcoredump, adopted command line process with captured list. Signed-off-by: Zhanjun Dong Zhanjun Dong (9): drm/xe/guc: Add register defines for GuC based register capture drm/xe/guc: Expose dss per group for GuC error capture drm/xe/guc: Update GuC ADS size for error capture drm/xe/guc: Add XE_LP steered register lists drm/xe/guc: Add capture size check in GuC log buffer drm/xe/guc: Check sizing of guc_capture output drm/xe/guc: Extract GuC error capture lists on G2H notification drm/xe/guc: Pre-allocate output nodes for extraction drm/xe/guc: Plumb GuC-capture into dev coredump drivers/gpu/drm/xe/Kconfig | 11 + drivers/gpu/drm/xe/Makefile |1 + drivers/gpu/drm/xe/abi/guc_actions_abi.h |7 + drivers/gpu/drm/xe/regs/xe_engine_regs.h | 12 + drivers/gpu/drm/xe/regs/xe_gt_regs.h | 20 + drivers/gpu/drm/xe/xe_gt_mcr.c |4 +- drivers/gpu/drm/xe/xe_gt_mcr.h |1 + drivers/gpu/drm/xe/xe_gt_printk.h|3 + drivers/gpu/drm/xe/xe_gt_topology.c |3 - drivers/gpu/drm/xe/xe_guc.c |7 + drivers/gpu/drm/xe/xe_guc_ads.c | 255 +++- drivers/gpu/drm/xe/xe_guc_ads_types.h|2 + drivers/gpu/drm/xe/xe_guc_capture.c | 1404 ++ drivers/gpu/drm/xe/xe_guc_capture.h | 35 + drivers/gpu/drm/xe/xe_guc_capture_fwif.h | 225 drivers/gpu/drm/xe/xe_guc_ct.c |2 + drivers/gpu/drm/xe/xe_guc_fwif.h | 68 ++ drivers/gpu/drm/xe/xe_guc_log.c | 179 +++ drivers/gpu/drm/xe/xe_guc_log.h | 15 + drivers/gpu/drm/xe/xe_guc_log_types.h| 26 + drivers/gpu/drm/xe/xe_guc_submit.c | 22 +- drivers/gpu/drm/xe/xe_guc_submit.h |3 + drivers/gpu/drm/xe/xe_guc_types.h|2 + drivers/gpu/drm/xe/xe_hw_engine.c| 73 +- drivers/gpu/drm/xe/xe_hw_engine_types.h | 106 +- 25 files changed, 2355 insertions(+), 131 deletions(-) create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.c create mode 100644 drivers/gpu/drm/xe/xe_guc_capture.h create mode 100644 drivers/gpu/drm/xe/xe_guc_capture_fwif.h
Re: [Intel-gfx] [PATCH] drm/i915: Skip pxp init if gt is wedged
On 2023-10-31 5:38 p.m., Teres Alexis, Alan Previn wrote: On Fri, 2023-10-27 at 10:13 +0300, Jani Nikula wrote: On Thu, 26 Oct 2023, Zhanjun Dong wrote: alan:snip I'll note that nobody checks intel_pxp_init() return status, so this silently skips PXP. BR, Jani. alan:snip + if (intel_gt_is_wedged(gt)) + return -ENODEV; + alan: wondering if we can add a drm_dbg in the caller of intel_pxp_init and use a unique return value for the case of gt_is_wedged (for example: -ENXIO.). As we know gt being wedged at startup basically means all gt usage is dead and therefore we cant enable PXP (along with everything else that needs submission/ guc/ etc). With a drm-debug in the caller that prints that return value, it helps us to differentiate between gt-is-wedged vs platform config doesnt support PXP. However, this would mean new drm-debug 'noise' for platforms that i915 just doesn't support PXP on at all which would be okay if dont use drm_warn or drm_err and use 'softer' message like "PXP skipped with %d". Please treat above comment as a "nit" - i.e. existing patch is good enough for me... (after addressing Jani's request for more commit message info). ...alan Thanks Alan. I agree, add more drm-debug looks like add noise in case of gt_is_wedged, existing code already output useful info. If logs already let us know gt_wedged happens and we are not expect pxp init running on gt wedged condition, then silently skip pxp_init looks like match the expectation. I will re-post with updated commit message later. Regards, Zhanjun
Re: [Intel-gfx] [PATCH 3/3] drm/i915/mtl: Add counters for engine busyness ticks
See comments inline below. Zhanjun On 2023-09-22 6:25 p.m., john.c.harri...@intel.com wrote: From: Umesh Nerlige Ramappa In new version of GuC engine busyness, GuC provides engine busyness ticks as a 64 bit counter. Add a new counter to relay this value to the user as is. Signed-off-by: Umesh Nerlige Ramappa Signed-off-by: John Harrison --- drivers/gpu/drm/i915/gt/intel_engine.h| 1 + drivers/gpu/drm/i915/gt/intel_engine_cs.c | 16 + drivers/gpu/drm/i915/gt/intel_engine_types.h | 12 drivers/gpu/drm/i915/gt/intel_engine_user.c | 1 + .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 67 ++- drivers/gpu/drm/i915/i915_pmu.c | 25 ++- drivers/gpu/drm/i915/i915_pmu.h | 2 +- include/uapi/drm/i915_drm.h | 13 +++- 8 files changed, 116 insertions(+), 21 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h index b58c30ac8ef02..57af7ec8ecd82 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine.h +++ b/drivers/gpu/drm/i915/gt/intel_engine.h @@ -249,6 +249,7 @@ void intel_engine_dump_active_requests(struct list_head *requests, ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now); +u64 intel_engine_get_busy_ticks(struct intel_engine_cs *engine); void intel_engine_get_hung_entity(struct intel_engine_cs *engine, struct intel_context **ce, struct i915_request **rq); diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 84a75c95f3f7d..1c9ffb1ae9889 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2426,6 +2426,22 @@ ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now) return engine->busyness(engine, now); } +/** + * intel_engine_get_busy_ticks() - Return current accumulated engine busyness + * ticks + * @engine: engine to report on + * + * Returns accumulated ticks @engine was busy since engine stats were enabled. + */ +u64 intel_engine_get_busy_ticks(struct intel_engine_cs *engine) +{ + if (!engine->busyness_ticks || + !(engine->flags & I915_ENGINE_SUPPORTS_TICKS_STATS)) + return 0; + + return engine->busyness_ticks(engine); +} + struct intel_context * intel_engine_create_virtual(struct intel_engine_cs **siblings, unsigned int count, unsigned long flags) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h index 40fd8f984d64b..a88d40c74d604 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h @@ -548,6 +548,11 @@ struct intel_engine_cs { ktime_t (*busyness)(struct intel_engine_cs *engine, ktime_t *now); + /* +* Get engine busyness ticks +*/ + u64 (*busyness_ticks)(struct intel_engine_cs *engine); + struct intel_engine_execlists execlists; /* @@ -574,6 +579,7 @@ struct intel_engine_cs { #define I915_ENGINE_HAS_EU_PRIORITYBIT(10) #define I915_ENGINE_FIRST_RENDER_COMPUTE BIT(11) #define I915_ENGINE_USES_WA_HOLD_CCS_SWITCHOUT BIT(12) +#define I915_ENGINE_SUPPORTS_TICKS_STATS BIT(13) unsigned int flags; /* @@ -649,6 +655,12 @@ intel_engine_supports_stats(const struct intel_engine_cs *engine) return engine->flags & I915_ENGINE_SUPPORTS_STATS; } +static inline bool +intel_engine_supports_tick_stats(const struct intel_engine_cs *engine) +{ + return engine->flags & I915_ENGINE_SUPPORTS_TICKS_STATS; +} + static inline bool intel_engine_has_preemption(const struct intel_engine_cs *engine) { diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index dcedff41a825f..69eb610b5ab0a 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -100,6 +100,7 @@ static void set_scheduler_caps(struct drm_i915_private *i915) MAP(HAS_PREEMPTION, PREEMPTION), MAP(HAS_SEMAPHORES, SEMAPHORES), MAP(SUPPORTS_STATS, ENGINE_BUSY_STATS), + MAP(SUPPORTS_TICKS_STATS, ENGINE_BUSY_TICKS_STATS), #undef MAP }; struct intel_engine_cs *engine; diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 0c1fee5360777..71749fb9ad35b 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1289,12 +1289,7 @@ static void busy_v1_guc_update_pm_timestamp(struct intel_guc *guc, ktime_t *now) guc->busy.v1.gt_stamp = ((u64)gt_stamp_hi << 32) | gt_stamp_lo; } -/* - * Unlike the execlist mode of
Re: [Intel-gfx] [PATCH v5] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
> -Original Message- > From: Daniel Vetter > Sent: August 22, 2023 9:51 AM > To: Dong, Zhanjun > Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; > Harrison, > John C ; Andi Shyti ; > Daniel Vetter > Subject: Re: [PATCH v5] drm/i915: Avoid circular locking dependency when > flush delayed work on gt reset > > On Fri, Aug 11, 2023 at 11:20:11AM -0700, Zhanjun Dong wrote: > > This attempts to avoid circular locking dependency between flush delayed > > work and intel_gt_reset. > > When intel_gt_reset was called, task will hold a lock. > > To cacel delayed work here, the _sync version will also acquire a lock, > > which might trigger the possible cirular locking dependency warning. > > When intel_gt_reset called, reset_in_progress flag will be set, add code > > to check the flag, call async verion if reset is in progress. > > > > Signed-off-by: Zhanjun Dong > > Cc: John Harrison > > Cc: Andi Shyti > > Cc: Daniel Vetter > > --- > > drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 ++- > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > index a0e3ef1c65d2..600388c849f7 100644 > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > > @@ -1359,7 +1359,16 @@ static void guc_enable_busyness_worker(struct > intel_guc *guc) > > > > static void guc_cancel_busyness_worker(struct intel_guc *guc) > > { > > - cancel_delayed_work_sync(>timestamp.work); > > + /* > > +* When intel_gt_reset was called, task will hold a lock. > > +* To cacel delayed work here, the _sync version will also acquire a > > lock, > which might > > +* trigger the possible cirular locking dependency warning. > > This is not even close to a locking bugfix. Consider this a formal nack, > because the issue here is not even close to "needs more comments to > explain what's going on". > -Daniel The purpose of the comment here it is to explain locking issue condition > > > +* Check the reset_in_progress flag, call async verion if reset is in > progress. The comment here explains check with the flag to avoid locking condition. The reset process is not considered to be complete in short time, other than that, do we missed anything? > > +*/ > > + if (guc_to_gt(guc)->uc.reset_in_progress) > > + cancel_delayed_work(>timestamp.work); > > + else > > + cancel_delayed_work_sync(>timestamp.work); > > } > > > > static void __reset_guc_busyness_stats(struct intel_guc *guc) > > -- > > 2.34.1 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
Re: [Intel-gfx] [PATCH v4] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
Hi Daniel, On 2023-08-03 9:03 a.m., Daniel Vetter wrote: On Thu, 27 Jul 2023 at 22:13, Zhanjun Dong wrote: This attempts to avoid circular locking dependency between flush delayed work and intel_gt_reset. Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync version for reset path, it is safe as the worker has the trylock code to handle the lock; Meanwhile keep the sync version for park/fini to ensure the worker is not still running during suspend or shutdown. WARNING: possible circular locking dependency detected 6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted -- kms_pipe_crc_ba/6415 is trying to acquire lock: 88813e6cc640 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: __flush_work+0x42/0x530 but task is already holding lock: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (>reset.mutex){+.+.}-{3:3}: lock_acquire+0xd8/0x2d0 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915] intel_gt_init_reset+0x65/0x80 [i915] intel_gt_common_init_early+0xe1/0x170 [i915] intel_root_gt_init_early+0x48/0x60 [i915] i915_driver_probe+0x671/0xcb0 [i915] i915_pci_probe+0xdc/0x210 [i915] pci_device_probe+0x95/0x120 really_probe+0x164/0x3c0 __driver_probe_device+0x73/0x160 driver_probe_device+0x19/0xa0 __driver_attach+0xb6/0x180 bus_for_each_dev+0x77/0xd0 bus_add_driver+0x114/0x210 driver_register+0x5b/0x110 __pfx_vgem_open+0x3/0x10 [vgem] do_one_initcall+0x57/0x270 do_init_module+0x5f/0x220 load_module+0x1ca4/0x1f00 __do_sys_finit_module+0xb4/0x130 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #2 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0xd8/0x2d0 fs_reclaim_acquire+0xac/0xe0 kmem_cache_alloc+0x32/0x260 i915_vma_instance+0xb2/0xc60 [i915] i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915] vm_fault_gtt+0x22d/0xf60 [i915] __do_fault+0x2f/0x1d0 do_pte_missing+0x4a/0xd20 __handle_mm_fault+0x5b0/0x790 handle_mm_fault+0xa2/0x230 do_user_addr_fault+0x3ea/0xa10 exc_page_fault+0x68/0x1a0 asm_exc_page_fault+0x26/0x30 -> #1 (>reset.backoff_srcu){}-{0:0}: lock_acquire+0xd8/0x2d0 _intel_gt_reset_lock+0x57/0x330 [i915] guc_timestamp_ping+0x35/0x130 [i915] process_one_work+0x250/0x510 worker_thread+0x4f/0x3a0 kthread+0xff/0x130 ret_from_fork+0x29/0x50 -> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}: check_prev_add+0x90/0xc60 __lock_acquire+0x1998/0x2590 lock_acquire+0xd8/0x2d0 __flush_work+0x74/0x530 __cancel_work_timer+0x14f/0x1f0 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915] intel_uc_reset_prepare+0x9c/0x120 [i915] reset_prepare+0x21/0x60 [i915] intel_gt_reset+0x1dd/0x470 [i915] intel_gt_reset_global+0xfb/0x170 [i915] intel_gt_handle_error+0x368/0x420 [i915] intel_gt_debugfs_reset_store+0x5c/0xc0 [i915] i915_wedged_set+0x29/0x40 [i915] simple_attr_write_xsigned.constprop.0+0xb4/0x110 full_proxy_write+0x52/0x80 vfs_write+0xc5/0x4f0 ksys_write+0x64/0xe0 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc other info that might help us debug this: Chain exists of: (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> >reset.mutex Possible unsafe locking scenario: CPU0CPU1 lock(>reset.mutex); lock(fs_reclaim); lock(>reset.mutex); lock((work_completion)(&(>timestamp.work)->work)); *** DEADLOCK *** 3 locks held by kms_pipe_crc_ba/6415: #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: simple_attr_write_xsigned.constprop.0+0x47/0x110 #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] v2: Add sync flag to guc_cancel_busyness_worker to ensure reset path calls asynchronous cancel. v3: Add sync flag to intel_guc_submission_disable to ensure reset path calls asynchronous cancel. v4: Set to always sync from __uc_fini_hw path. Signed-off-by: Zhanjun Dong Cc: John Harrison Cc: Andi Shyti --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 ++--- .../gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc.c | 4 ++-- 3 files changed, 13 insertions(+), 10
Re: [Intel-gfx] [PATCH v4] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
Hi Andi, On 2023-08-03 8:36 a.m., Andi Shyti wrote: Hi Zhanjun, On Thu, Jul 27, 2023 at 01:13:23PM -0700, Zhanjun Dong wrote: This attempts to avoid circular locking dependency between flush delayed work and intel_gt_reset. Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync version for reset path, it is safe as the worker has the trylock code to handle the lock; Meanwhile keep the sync version for park/fini to ensure the worker is not still running during suspend or shutdown. Next time, please wrap the sentences to 65 characters (standing to the e-mail netiquette, RFC1855[1]) or 70-75 characters (standing to the kernel guidelines[2]). [1] https://www.ietf.org/rfc/rfc1855.txt chapter "2.1.1 For mail", page 3 [2] https://docs.kernel.org/process/submitting-patches.html chapter "The canonical patch format" Thanks, will be fixed in next revision. WARNING: possible circular locking dependency detected 6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted -- kms_pipe_crc_ba/6415 is trying to acquire lock: 88813e6cc640 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: __flush_work+0x42/0x530 but task is already holding lock: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (>reset.mutex){+.+.}-{3:3}: lock_acquire+0xd8/0x2d0 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915] intel_gt_init_reset+0x65/0x80 [i915] intel_gt_common_init_early+0xe1/0x170 [i915] intel_root_gt_init_early+0x48/0x60 [i915] i915_driver_probe+0x671/0xcb0 [i915] i915_pci_probe+0xdc/0x210 [i915] pci_device_probe+0x95/0x120 really_probe+0x164/0x3c0 __driver_probe_device+0x73/0x160 driver_probe_device+0x19/0xa0 __driver_attach+0xb6/0x180 bus_for_each_dev+0x77/0xd0 bus_add_driver+0x114/0x210 driver_register+0x5b/0x110 __pfx_vgem_open+0x3/0x10 [vgem] do_one_initcall+0x57/0x270 do_init_module+0x5f/0x220 load_module+0x1ca4/0x1f00 __do_sys_finit_module+0xb4/0x130 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #2 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0xd8/0x2d0 fs_reclaim_acquire+0xac/0xe0 kmem_cache_alloc+0x32/0x260 i915_vma_instance+0xb2/0xc60 [i915] i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915] vm_fault_gtt+0x22d/0xf60 [i915] __do_fault+0x2f/0x1d0 do_pte_missing+0x4a/0xd20 __handle_mm_fault+0x5b0/0x790 handle_mm_fault+0xa2/0x230 do_user_addr_fault+0x3ea/0xa10 exc_page_fault+0x68/0x1a0 asm_exc_page_fault+0x26/0x30 -> #1 (>reset.backoff_srcu){}-{0:0}: lock_acquire+0xd8/0x2d0 _intel_gt_reset_lock+0x57/0x330 [i915] guc_timestamp_ping+0x35/0x130 [i915] process_one_work+0x250/0x510 worker_thread+0x4f/0x3a0 kthread+0xff/0x130 ret_from_fork+0x29/0x50 -> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}: check_prev_add+0x90/0xc60 __lock_acquire+0x1998/0x2590 lock_acquire+0xd8/0x2d0 __flush_work+0x74/0x530 __cancel_work_timer+0x14f/0x1f0 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915] intel_uc_reset_prepare+0x9c/0x120 [i915] reset_prepare+0x21/0x60 [i915] intel_gt_reset+0x1dd/0x470 [i915] intel_gt_reset_global+0xfb/0x170 [i915] intel_gt_handle_error+0x368/0x420 [i915] intel_gt_debugfs_reset_store+0x5c/0xc0 [i915] i915_wedged_set+0x29/0x40 [i915] simple_attr_write_xsigned.constprop.0+0xb4/0x110 full_proxy_write+0x52/0x80 vfs_write+0xc5/0x4f0 ksys_write+0x64/0xe0 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc other info that might help us debug this: Chain exists of: (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> >reset.mutex Possible unsafe locking scenario: CPU0CPU1 lock(>reset.mutex); lock(fs_reclaim); lock(>reset.mutex); lock((work_completion)(&(>timestamp.work)->work)); *** DEADLOCK *** 3 locks held by kms_pipe_crc_ba/6415: #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: simple_attr_write_xsigned.constprop.0+0x47/0x110 #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] v2: Add sync flag to guc_cancel_busyness_worker to ensure reset path calls asynchronous cancel. v3: Add sync flag to
Re: [Intel-gfx] [PATCH] drm/i915/gt: Remove incorrect hard coded cache coherrency setting
Hi Fei, Thanks for review. I put my answers inline below. Regards, Zhanjun On 2023-06-22 6:20 p.m., Yang, Fei wrote: > The previouse i915_gem_object_create_internal already set it with proper > value before function return. This hard coded setting is incorrect for > platforms like MTL, thus need to be removed. > > Signed-off-by: Zhanjun Dong > --- > drivers/gpu/drm/i915/gt/intel_timeline.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c > index b9640212d659..693d18e14b00 100644 > --- a/drivers/gpu/drm/i915/gt/intel_timeline.c > +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c > @@ -26,8 +26,6 @@ static struct i915_vma *hwsp_alloc(struct intel_gt *gt) > if (IS_ERR(obj)) > return ERR_CAST(obj); > > - i915_gem_object_set_cache_coherency(obj, I915_CACHE_LLC); > - Does this change really fix the coherency issue? Testing in progress. Issue reported by E2E team, now is their public holiday. Meanwhile, I have trouble to run the test case on my setup. Need to sync with them later. I consulted with Chris and he said that the hwsp is purposely set to be cacheable. The mapping on CPU side also indicates it's cacheable, For single end access area that setting works well. Here the problem is the head/tail memory area requires different cache setting. As the previous i915_gem_object_create_internal already set the cache setting for current platform properly, why we overwrite it here? intel_timeline_pin_map(struct intel_timeline *timeline) { struct drm_i915_gem_object *obj = timeline->hwsp_ggtt->obj; u32 ofs = offset_in_page(timeline->hwsp_offset); void *vaddr; vaddr = i915_gem_object_pin_map(obj, I915_MAP_WB); ... } Maybe we should also set it to match platform as well? > vma = i915_vma_instance(obj, >ggtt->vm, NULL); > if (IS_ERR(vma)) > i915_gem_object_put(obj); > -- > 2.34.1
Re: [Intel-gfx] [PATCH] drm/i915/gt: Remove incorrect hard coded cache coherrency setting
Resend to restart the CI, https://patchwork.freedesktop.org/series/119485/ Was stuck. Regards, Zhanjun On 2023-06-22 11:26 a.m., Zhanjun Dong wrote: The previouse i915_gem_object_create_internal already set it with proper value before function return. This hard coded setting is incorrect for platforms like MTL, thus need to be removed. Signed-off-by: Zhanjun Dong --- drivers/gpu/drm/i915/gt/intel_timeline.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c index b9640212d659..693d18e14b00 100644 --- a/drivers/gpu/drm/i915/gt/intel_timeline.c +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c @@ -26,8 +26,6 @@ static struct i915_vma *hwsp_alloc(struct intel_gt *gt) if (IS_ERR(obj)) return ERR_CAST(obj); - i915_gem_object_set_cache_coherency(obj, I915_CACHE_LLC); - vma = i915_vma_instance(obj, >ggtt->vm, NULL); if (IS_ERR(vma)) i915_gem_object_put(obj);
Re: [Intel-gfx] [PATCH v3] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
V3 is to follow John's suggestion option 1. The better option is in discussion and might have boarder impact. Meanwhile we can start with option 1, check CI system report and see if issue getting better. Regards, Zhanjun Dong On 2023-06-15 5:15 p.m., Zhanjun Dong wrote: This attempts to avoid circular locking dependency between flush delayed work and intel_gt_reset. Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync version for reset path, it is safe as the worker has the trylock code to handle the lock; Meanwhile keep the sync version for park/fini to ensure the worker is not still running during suspend or shutdown. WARNING: possible circular locking dependency detected 6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted -- kms_pipe_crc_ba/6415 is trying to acquire lock: 88813e6cc640 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: __flush_work+0x42/0x530 but task is already holding lock: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (>reset.mutex){+.+.}-{3:3}: lock_acquire+0xd8/0x2d0 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915] intel_gt_init_reset+0x65/0x80 [i915] intel_gt_common_init_early+0xe1/0x170 [i915] intel_root_gt_init_early+0x48/0x60 [i915] i915_driver_probe+0x671/0xcb0 [i915] i915_pci_probe+0xdc/0x210 [i915] pci_device_probe+0x95/0x120 really_probe+0x164/0x3c0 __driver_probe_device+0x73/0x160 driver_probe_device+0x19/0xa0 __driver_attach+0xb6/0x180 bus_for_each_dev+0x77/0xd0 bus_add_driver+0x114/0x210 driver_register+0x5b/0x110 __pfx_vgem_open+0x3/0x10 [vgem] do_one_initcall+0x57/0x270 do_init_module+0x5f/0x220 load_module+0x1ca4/0x1f00 __do_sys_finit_module+0xb4/0x130 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #2 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0xd8/0x2d0 fs_reclaim_acquire+0xac/0xe0 kmem_cache_alloc+0x32/0x260 i915_vma_instance+0xb2/0xc60 [i915] i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915] vm_fault_gtt+0x22d/0xf60 [i915] __do_fault+0x2f/0x1d0 do_pte_missing+0x4a/0xd20 __handle_mm_fault+0x5b0/0x790 handle_mm_fault+0xa2/0x230 do_user_addr_fault+0x3ea/0xa10 exc_page_fault+0x68/0x1a0 asm_exc_page_fault+0x26/0x30 -> #1 (>reset.backoff_srcu){}-{0:0}: lock_acquire+0xd8/0x2d0 _intel_gt_reset_lock+0x57/0x330 [i915] guc_timestamp_ping+0x35/0x130 [i915] process_one_work+0x250/0x510 worker_thread+0x4f/0x3a0 kthread+0xff/0x130 ret_from_fork+0x29/0x50 -> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}: check_prev_add+0x90/0xc60 __lock_acquire+0x1998/0x2590 lock_acquire+0xd8/0x2d0 __flush_work+0x74/0x530 __cancel_work_timer+0x14f/0x1f0 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915] intel_uc_reset_prepare+0x9c/0x120 [i915] reset_prepare+0x21/0x60 [i915] intel_gt_reset+0x1dd/0x470 [i915] intel_gt_reset_global+0xfb/0x170 [i915] intel_gt_handle_error+0x368/0x420 [i915] intel_gt_debugfs_reset_store+0x5c/0xc0 [i915] i915_wedged_set+0x29/0x40 [i915] simple_attr_write_xsigned.constprop.0+0xb4/0x110 full_proxy_write+0x52/0x80 vfs_write+0xc5/0x4f0 ksys_write+0x64/0xe0 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc other info that might help us debug this: Chain exists of: (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> >reset.mutex Possible unsafe locking scenario: CPU0CPU1 lock(>reset.mutex); lock(fs_reclaim); lock(>reset.mutex); lock((work_completion)(&(>timestamp.work)->work)); *** DEADLOCK *** 3 locks held by kms_pipe_crc_ba/6415: #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: simple_attr_write_xsigned.constprop.0+0x47/0x110 #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] v2: Add sync flag to guc_cancel_busyness_worker to ensure reset path calls asynchronous cancel. v3: Add sync flag to intel_guc_submission_disable to ensure reset path calls asynchronous cancel. Signed-off-by: Zhanjun Dong --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 ++--- .../gpu/drm/i915/gt/uc/intel_guc_submission.h | 2 +-
Re: [Intel-gfx] [PATCH] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
John mentioned 2 options: 1. “add the sync/async flag to _submission_disable()” Consider to be a small change 1. add an 'are busyness stats enabled' boolean to the guc structure Seems effected area among the flow and is much more than option 1. I would like to discuss a bit more before moving forward. Regards, Zhanjun Dong From: Harrison, John C Sent: June 7, 2023 4:17 PM To: Dong, Zhanjun ; intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; Nerlige Ramappa, Umesh ; Ceraolo Spurio, Daniele Subject: Re: [Intel-gfx] [PATCH] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset On 6/7/2023 12:03, Zhanjun Dong wrote: This attempts to avoid circular locking dependency between flush delayed work and intel_gt_reset. Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync version for reset path, it is safe as the worker has the trylock code to handle the lock; Meanwhile keep the sync version for park/fini to ensure the worker is not still running during suspend or shutdown. WARNING: possible circular locking dependency detected 6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted -- kms_pipe_crc_ba/6415 is trying to acquire lock: 88813e6cc640 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: __flush_work+0x42/0x530 but task is already holding lock: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (>reset.mutex){+.+.}-{3:3}: lock_acquire+0xd8/0x2d0 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915] intel_gt_init_reset+0x65/0x80 [i915] intel_gt_common_init_early+0xe1/0x170 [i915] intel_root_gt_init_early+0x48/0x60 [i915] i915_driver_probe+0x671/0xcb0 [i915] i915_pci_probe+0xdc/0x210 [i915] pci_device_probe+0x95/0x120 really_probe+0x164/0x3c0 __driver_probe_device+0x73/0x160 driver_probe_device+0x19/0xa0 __driver_attach+0xb6/0x180 bus_for_each_dev+0x77/0xd0 bus_add_driver+0x114/0x210 driver_register+0x5b/0x110 __pfx_vgem_open+0x3/0x10 [vgem] do_one_initcall+0x57/0x270 do_init_module+0x5f/0x220 load_module+0x1ca4/0x1f00 __do_sys_finit_module+0xb4/0x130 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #2 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0xd8/0x2d0 fs_reclaim_acquire+0xac/0xe0 kmem_cache_alloc+0x32/0x260 i915_vma_instance+0xb2/0xc60 [i915] i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915] vm_fault_gtt+0x22d/0xf60 [i915] __do_fault+0x2f/0x1d0 do_pte_missing+0x4a/0xd20 __handle_mm_fault+0x5b0/0x790 handle_mm_fault+0xa2/0x230 do_user_addr_fault+0x3ea/0xa10 exc_page_fault+0x68/0x1a0 asm_exc_page_fault+0x26/0x30 -> #1 (>reset.backoff_srcu){}-{0:0}: lock_acquire+0xd8/0x2d0 _intel_gt_reset_lock+0x57/0x330 [i915] guc_timestamp_ping+0x35/0x130 [i915] process_one_work+0x250/0x510 worker_thread+0x4f/0x3a0 kthread+0xff/0x130 ret_from_fork+0x29/0x50 -> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}: check_prev_add+0x90/0xc60 __lock_acquire+0x1998/0x2590 lock_acquire+0xd8/0x2d0 __flush_work+0x74/0x530 __cancel_work_timer+0x14f/0x1f0 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915] intel_uc_reset_prepare+0x9c/0x120 [i915] reset_prepare+0x21/0x60 [i915] intel_gt_reset+0x1dd/0x470 [i915] intel_gt_reset_global+0xfb/0x170 [i915] intel_gt_handle_error+0x368/0x420 [i915] intel_gt_debugfs_reset_store+0x5c/0xc0 [i915] i915_wedged_set+0x29/0x40 [i915] simple_attr_write_xsigned.constprop.0+0xb4/0x110 full_proxy_write+0x52/0x80 vfs_write+0xc5/0x4f0 ksys_write+0x64/0xe0 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc other info that might help us debug this: Chain exists of: (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> >reset.mutex Possible unsafe locking scenario: CPU0CPU1 lock(>reset.mutex); lock(fs_reclaim); lock(>reset.mutex); lock((work_completion)(&(>timestamp.work)->work)); *** DEADLOCK *** 3 locks held by kms_pipe_crc_ba/6415: #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: simple_attr_write_x
Re: [Intel-gfx] [PATCH] drm/i915: Avoid circular locking dependency when flush delayed work on gt reset
Hi Andi, Thanks for comments. Info would be updated on next revision, which is on the way. Regards, Zhanjun Dong On 2023-06-07 8:19 p.m., Andi Shyti wrote: Hi Dong, On Wed, Jun 07, 2023 at 12:03:50PM -0700, Zhanjun Dong wrote: This attempts to avoid circular locking dependency between flush delayed work and intel_gt_reset. Switched from cancel_delayed_work_sync to cancel_delayed_work, the non-sync version for reset path, it is safe as the worker has the trylock code to handle the lock; Meanwhile keep the sync version for park/fini to ensure the worker is not still running during suspend or shutdown. WARNING: possible circular locking dependency detected 6.4.0-rc1-drmtip_1340-g31e3463b0edb+ #1 Not tainted -- kms_pipe_crc_ba/6415 is trying to acquire lock: 88813e6cc640 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}, at: __flush_work+0x42/0x530 but task is already holding lock: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (>reset.mutex){+.+.}-{3:3}: lock_acquire+0xd8/0x2d0 i915_gem_shrinker_taints_mutex+0x31/0x50 [i915] intel_gt_init_reset+0x65/0x80 [i915] intel_gt_common_init_early+0xe1/0x170 [i915] intel_root_gt_init_early+0x48/0x60 [i915] i915_driver_probe+0x671/0xcb0 [i915] i915_pci_probe+0xdc/0x210 [i915] pci_device_probe+0x95/0x120 really_probe+0x164/0x3c0 __driver_probe_device+0x73/0x160 driver_probe_device+0x19/0xa0 __driver_attach+0xb6/0x180 bus_for_each_dev+0x77/0xd0 bus_add_driver+0x114/0x210 driver_register+0x5b/0x110 __pfx_vgem_open+0x3/0x10 [vgem] do_one_initcall+0x57/0x270 do_init_module+0x5f/0x220 load_module+0x1ca4/0x1f00 __do_sys_finit_module+0xb4/0x130 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc -> #2 (fs_reclaim){+.+.}-{0:0}: lock_acquire+0xd8/0x2d0 fs_reclaim_acquire+0xac/0xe0 kmem_cache_alloc+0x32/0x260 i915_vma_instance+0xb2/0xc60 [i915] i915_gem_object_ggtt_pin_ww+0x175/0x370 [i915] vm_fault_gtt+0x22d/0xf60 [i915] __do_fault+0x2f/0x1d0 do_pte_missing+0x4a/0xd20 __handle_mm_fault+0x5b0/0x790 handle_mm_fault+0xa2/0x230 do_user_addr_fault+0x3ea/0xa10 exc_page_fault+0x68/0x1a0 asm_exc_page_fault+0x26/0x30 -> #1 (>reset.backoff_srcu){}-{0:0}: lock_acquire+0xd8/0x2d0 _intel_gt_reset_lock+0x57/0x330 [i915] guc_timestamp_ping+0x35/0x130 [i915] process_one_work+0x250/0x510 worker_thread+0x4f/0x3a0 kthread+0xff/0x130 ret_from_fork+0x29/0x50 -> #0 ((work_completion)(&(>timestamp.work)->work)){+.+.}-{0:0}: check_prev_add+0x90/0xc60 __lock_acquire+0x1998/0x2590 lock_acquire+0xd8/0x2d0 __flush_work+0x74/0x530 __cancel_work_timer+0x14f/0x1f0 intel_guc_submission_reset_prepare+0x81/0x4b0 [i915] intel_uc_reset_prepare+0x9c/0x120 [i915] reset_prepare+0x21/0x60 [i915] intel_gt_reset+0x1dd/0x470 [i915] intel_gt_reset_global+0xfb/0x170 [i915] intel_gt_handle_error+0x368/0x420 [i915] intel_gt_debugfs_reset_store+0x5c/0xc0 [i915] i915_wedged_set+0x29/0x40 [i915] simple_attr_write_xsigned.constprop.0+0xb4/0x110 full_proxy_write+0x52/0x80 vfs_write+0xc5/0x4f0 ksys_write+0x64/0xe0 do_syscall_64+0x3c/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc other info that might help us debug this: Chain exists of: (work_completion)(&(>timestamp.work)->work) --> fs_reclaim --> >reset.mutex Possible unsafe locking scenario: CPU0CPU1 lock(>reset.mutex); lock(fs_reclaim); lock(>reset.mutex); lock((work_completion)(&(>timestamp.work)->work)); *** DEADLOCK *** 3 locks held by kms_pipe_crc_ba/6415: #0: 888101541430 (sb_writers#15){.+.+}-{0:0}, at: ksys_write+0x64/0xe0 #1: 888136c7eab8 (>mutex){+.+.}-{3:3}, at: simple_attr_write_xsigned.constprop.0+0x47/0x110 #2: 88813e6cce90 (>reset.mutex){+.+.}-{3:3}, at: intel_gt_reset+0x19e/0x470 [i915] Signed-off-by: Zhanjun Dong Andrzej's r-b is missing here. --- Please add a version to your patch and a changelog. Thanks, Andi
Re: [Intel-gfx] [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests
> -Original Message- > From: Teres Alexis, Alan Previn > Sent: June 8, 2023 2:31 PM > To: Dong, Zhanjun ; intel- > g...@lists.freedesktop.org > Cc: dri-de...@lists.freedesktop.org > Subject: Re: [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes > before selftests > > On Thu, 2023-06-08 at 18:14 +, Dong, Zhanjun wrote: > > See my comments below. > > > > > -Original Message- > > > From: Alan Previn > alan:snip > > > > +static int > > > +__wait_gsc_proxy_completed(struct drm_i915_private *i915, > > > +unsigned long timeout_ms) > > > +{ > > > + bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY) > > > && > > > + i915->media_gt && > > > + HAS_ENGINE(i915->media_gt, GSC0) && > > > + intel_uc_fw_is_loadable(>media_gt- > > > > uc.gsc.fw)); > > > + > > > + /* > > > + * For gsc proxy component loading + init, we need a much longer > > > timeout > > > + * than what CI selftest infrastrucutre currently uses. This longer wait > > > + * period depends on the kernel config and component driver load > > > ordering > > > + */ > > > + if (timeout_ms < 8000) > > > + timeout_ms = 8000; > > > > > > Lgtm, just an concern about the fixed number here, shall we set the > minimal here, or let i915_selftest.timeout_ms take control? Thus no longer > need coding change here in the future. > > > > Reviewed-by: Zhanjun Dong > > Thanks Zhanjun, unfortunately, based on internal testing, > i915_selftest.timeout_ms default is too > low that it does occasionally timeout for CI. From experience, with a lean > ubuntu config, it typically > takes about ~1 seconds for the mei-gsc-sw-proxy component driver to load > after i915 loads. > Since CI regular unloads and reloads i915, the timeout observed ends up > being reported as issue. > > 8 seconds was based on internal testing of the worst case scenario - which > hardly ever happens. > We've only seen the 8 second happen when the kernel config has configs > enabled for very many SOC IP > drivers and component driver (seen one at least one customer config) or if > the MTL board IFWI was only > just reflashed (this would be a one-off 8 seconds, we suspect due to the > firmware doing additional steps) > Thanks for detailed info. The i915_selftest.timeout_ms is too low for this test, so it need an special minimal for itself, valid reason. The concern I raised is at minor level. Looks good to me. Regards, Zhanjun
Re: [Intel-gfx] [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before selftests
See my comments below. > -Original Message- > From: Alan Previn > Sent: May 30, 2023 1:01 PM > To: intel-gfx@lists.freedesktop.org > Cc: dri-de...@lists.freedesktop.org; Alan Previn > > Subject: [v2] drm/i915/selftest/gsc: Ensure GSC Proxy init completes before > selftests > > On MTL, if the GSC Proxy init flows haven't completed, submissions to the > GSC engine will fail. Those init flows are dependent on the mei's > gsc_proxy component that is loaded in parallel with i915 and a > worker that could potentially start after i915 driver init is done. > > That said, all subsytems that access the GSC engine today does check > for such init flow completion before using the GSC engine. However, > selftests currently don't wait on anything before starting. > > To fix this, add a waiter function at the start of __run_selftests > that waits for gsc-proxy init flows to complete. While implementing this, > use an table of function pointers so its scalable to add additional > waiter functions for future such "wait on dependency" cases that. > > Difference from prior versions: >v1: Based on internal testing, increase the timeout for gsc-proxy >specific case to 8 seconds. > > Signed-off-by: Alan Previn > --- > .../gpu/drm/i915/selftests/i915_selftest.c| 61 +++ > 1 file changed, 61 insertions(+) > > > base-commit: 0ae4ee2c735979030a0219218081eee661606921 > > diff --git a/drivers/gpu/drm/i915/selftests/i915_selftest.c > b/drivers/gpu/drm/i915/selftests/i915_selftest.c > index 39da0fb0d6d2..77168a7a7e3f 100644 > --- a/drivers/gpu/drm/i915/selftests/i915_selftest.c > +++ b/drivers/gpu/drm/i915/selftests/i915_selftest.c > @@ -24,6 +24,8 @@ > #include > > #include "gt/intel_gt_pm.h" > +#include "gt/uc/intel_gsc_fw.h" > + > #include "i915_driver.h" > #include "i915_drv.h" > #include "i915_selftest.h" > @@ -127,6 +129,63 @@ static void set_default_test_all(struct selftest *st, > unsigned int count) > st[i].enabled = true; > } > > +static int > +__wait_gsc_proxy_completed(struct drm_i915_private *i915, > +unsigned long timeout_ms) > +{ > + bool need_to_wait = (IS_ENABLED(CONFIG_INTEL_MEI_GSC_PROXY) > && > + i915->media_gt && > + HAS_ENGINE(i915->media_gt, GSC0) && > + intel_uc_fw_is_loadable(>media_gt- > >uc.gsc.fw)); > + > + /* > + * For gsc proxy component loading + init, we need a much longer > timeout > + * than what CI selftest infrastrucutre currently uses. This longer wait > + * period depends on the kernel config and component driver load > ordering > + */ > + if (timeout_ms < 8000) > + timeout_ms = 8000; Lgtm, just an concern about the fixed number here, shall we set the minimal here, or let i915_selftest.timeout_ms take control? Thus no longer need coding change here in the future. Reviewed-by: Zhanjun Dong > + > + if (need_to_wait && > + (wait_for(intel_gsc_uc_fw_proxy_init_done(>media_gt- > >uc.gsc), > + timeout_ms))) > + return -ETIME; > + > + return 0; > +} > + > +struct __startup_waiter { > + const char *name; > + int (*wait_to_completed)(struct drm_i915_private *i915, unsigned > long timeout_ms); > +}; > + > +static struct __startup_waiter all_startup_waiters[] = { \ > + {"gsc_proxy", __wait_gsc_proxy_completed} \ > + }; > + > +static int __wait_on_all_system_dependencies(struct drm_i915_private > *i915) > +{ > + struct __startup_waiter *waiter = all_startup_waiters; > + int count = ARRAY_SIZE(all_startup_waiters); > + int ret; > + > + if (!waiter || !count || !i915) > + return 0; > + > + for (; count--; waiter++) { > + if (!waiter->wait_to_completed) > + continue; > + ret = waiter->wait_to_completed(i915, > i915_selftest.timeout_ms); > + if (ret) { > + pr_info(DRIVER_NAME ": Pre-selftest waiter %s failed > with %d\n", > + waiter->name, ret); > + return ret; > + } > + } > + > + return 0; > +} > + > static int __run_selftests(const char *name, > struct selftest *st, > unsigned int count, > @@ -134,6 +193,8 @@ static int __run_selftests(const char *name, > { > int err = 0; > > + __wait_on_all_system_dependencies(data); > + > while (!i915_selftest.random_seed) > i915_selftest.random_seed = get_random_u32(); >
Re: [Intel-gfx] [PATCH] drm/i915: Set wedged if enable guc communication failed
Thanks Jani. Updated patch sent, let me know if you have any comments. Regards, Zhanjun > -Original Message- > From: Jani Nikula > Sent: February 27, 2023 6:30 AM > To: Dong, Zhanjun ; intel- > g...@lists.freedesktop.org; dri-de...@lists.freedesktop.org > Cc: Dong, Zhanjun > Subject: Re: [PATCH] drm/i915: Set wedged if enable guc communication > failed > > On Fri, 24 Feb 2023, Zhanjun Dong wrote: > > Add err code check for enable_communication on resume path, set > wedged if failed. > > I can see that this is *what* the code does, but the commit message should > answer the question *why*. > > > > > Signed-off-by: Zhanjun Dong > > --- > > drivers/gpu/drm/i915/gt/intel_gt_pm.c | 5 - > > drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++-- > > 2 files changed, 11 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c > > b/drivers/gpu/drm/i915/gt/intel_gt_pm.c > > index cef3d6f5c34e..f3bb7cbbd293 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c > > @@ -401,8 +401,11 @@ int intel_gt_runtime_resume(struct intel_gt *gt) > > intel_ggtt_restore_fences(gt->ggtt); > > > > ret = intel_uc_runtime_resume(>uc); > > - if (ret) > > + if (ret) { > > + /* Set wedge if uc resume failed */ > > This comment is just a reiteration of the C code in English, but doesn't > provide any useful additional information. > > BR, > Jani. > > > + intel_gt_set_wedged(gt); > > return ret; > > + } > > > > return 0; > > } > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c > > b/drivers/gpu/drm/i915/gt/uc/intel_uc.c > > index 6648691bd645..d4f428acf20a 100644 > > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c > > @@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool > enable_communication) > > /* Make sure we enable communication if and only if it's disabled */ > > GEM_BUG_ON(enable_communication == > intel_guc_ct_enabled(>ct)); > > > > - if (enable_communication) > > - guc_enable_communication(guc); > > + if (enable_communication) { > > + err = guc_enable_communication(guc); > > + if (err) { > > + guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err)); > > + return err; > > + } > > + } > > > > /* If we are only resuming GuC communication but not reloading > > * GuC, we need to ensure the ARAT timer interrupt is enabled > > -- > Jani Nikula, Intel Open Source Graphics Center
Re: [Intel-gfx] [PATCH] drm/i915/guc: Check for ct enabled while waiting for response
> -Original Message- > From: Dixit, Ashutosh > Sent: July 12, 2022 3:48 PM > To: Dong, Zhanjun > Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org > Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: Check for ct enabled while > waiting for response > > On Thu, 16 Jun 2022 21:50:55 -0700, Dixit, Ashutosh wrote: > > > > On Thu, 16 Jun 2022 15:01:59 -0700, Zhanjun Dong wrote: > > > > > > We are seeing error message of "No response for request". Some cases > > > happened while waiting for response and reset/suspend action was > triggered. > > > In this case, no response is not an error, active requests will be > > > cancelled. > > > > > > This patch will handle this condition and change the error message > > > into debug message. > > > > > > Signed-off-by: Zhanjun Dong > > > --- > > > drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 24 > > > --- > > > 1 file changed, 17 insertions(+), 7 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c > > > b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c > > > index f01325cd1b62..f07a7666b1ad 100644 > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c > > > @@ -455,6 +455,7 @@ static int ct_write(struct intel_guc_ct *ct, > > > > > > /** > > > * wait_for_ct_request_update - Wait for CT request state update. > > > + * @ct: pointer to CT > > > * @req: pointer to pending request > > > * @status: placeholder for status > > > * > > > @@ -467,9 +468,10 @@ static int ct_write(struct intel_guc_ct *ct, > > > * * 0 response received (status is valid) > > > * * -ETIMEDOUT no response within hardcoded timeout > > > */ > > > -static int wait_for_ct_request_update(struct ct_request *req, u32 > > > *status) > > > +static int wait_for_ct_request_update(struct intel_guc_ct *ct, > > > +struct ct_request *req, u32 *status) > > > { > > > int err; > > > + bool ct_enabled; > > > > > > /* > > >* Fast commands should complete in less than 10us, so sample > > >quickly @@ -481,12 +483,15 @@ static int > > >wait_for_ct_request_update(struct ct_request *req, u32 *status) > > > #define GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10 > > > #define GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000 > > > #define done \ > > > - (FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == > \ > > > + (!(ct_enabled = intel_guc_ct_enabled(ct)) || \ > > > + FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == > \ > > >GUC_HXG_ORIGIN_GUC) > > > err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS); > > > if (err) > > > err = wait_for(done, > GUC_CTB_RESPONSE_TIMEOUT_LONG_MS); > > > #undef done > > > + if (!ct_enabled) > > > + err = -ECANCELED; > > > > Actually here's an even simpler suggestion. We could just do: > > > > if (!ct_enabled) > > CT_DEBUG(ct, "Request %#x (fence %u) cancelled as CTB is > disabled\n", > >...); > > > > And return 0 as before. This way we won't have to make any changes in > > either ct_send() or intel_guc_ct_send(). So intel_guc_ct_enabled() > > just serves to get us out of the wait early and prevent the -ETIMEDOUT > > return (and 0 return avoids all the error messages we are trying to > eliminate). > > Actually will need to unlink the request too, so it will be something like: > > if (!ct_enabled) { > CT_DEBUG(ct, "Request %#x (fence %u) cancelled as CTB is > disabled\n", ...); > > spin_lock_irqsave(>requests.lock, flags); > list_del(); > spin_unlock_irqrestore(>requests.lock, flags); > } I agree, the caller function need the err is non-zero to know the request is not success, and unlink the request. The caller function ct_send will do the unlink. For the err code ECANCELED, while in intel_guc_ct_send, it returns ENODEV if ct is disabled. This patch will be changed to ENODEV to match it.
Re: [Intel-gfx] [PATCH] drm/i915/guc: Check ctx while waiting for response
Thanks for all comments, I will update code and prepare for next version. Regards, Zhanjun -Original Message- From: Dixit, Ashutosh Sent: June 14, 2022 12:28 PM To: Dong, Zhanjun Cc: intel-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org; Teres Alexis, Alan Previn ; Wajdeczko, Michal Subject: Re: [PATCH] drm/i915/guc: Check ctx while waiting for response On Thu, 02 Jun 2022 10:21:19 -0700, Zhanjun Dong wrote: > Hi Zhanjun, > We are seeing error message of "No response for request". Some cases > happened while waiting for response and reset/suspend action was > triggered. In this case, no response is not an error, active requests will be > cancelled. > > This patch will handle this condition and change the error message > into debug message. IMO the patch title should be changed: which ctx are we checking while waiting for response? Something like "check for ct enabled while waiting for response"? > @@ -481,12 +481,14 @@ static int wait_for_ct_request_update(struct > ct_request *req, u32 *status) #define > GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS 10 #define > GUC_CTB_RESPONSE_TIMEOUT_LONG_MS 1000 #define done \ > - (FIELD_GET(GUC_HXG_MSG_0_ORIGIN, READ_ONCE(req->status)) == \ > + (!intel_guc_ct_enabled(ct) || FIELD_GET(GUC_HXG_MSG_0_ORIGIN, > +READ_ONCE(req->status)) == \ >GUC_HXG_ORIGIN_GUC) > err = wait_for_us(done, GUC_CTB_RESPONSE_TIMEOUT_SHORT_MS); > if (err) > err = wait_for(done, GUC_CTB_RESPONSE_TIMEOUT_LONG_MS); > #undef done > + if (!intel_guc_ct_enabled(ct)) > + err = -ECANCELED; Also, I really don't like intel_guc_ct_enabled() being called in two places. Is there a possibility that intel_guc_ct_enabled() can return false in the first place (causing the wait to exit) and then return true in the second place (so we don't return -ECANCELED)? Is it possible to change the status of the request to something else from intel_guc_ct_disable() (or wherever ct->enabled is set to false) rather than introducing intel_guc_ct_enabled() checks here. Changing the status of the request when CT goes down would cause the wait's to exit here. And then we can check that special request status signifying CT went down? Thanks. -- Ashutosh