On 8/8/25 1:03 AM, Wang, Yang(Kevin) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
> 
> Reviewed-by: Yang Wang <[email protected]>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *发件人:* Lazar, Lijo <[email protected]>
> *发送时间:* Friday, August 8, 2025 12:58:55 PM
> *收件人:* [email protected] <[email protected]>
> *抄送:* Zhang, Hawking <[email protected]>; Deucher, Alexander 
> <[email protected]>; Sun, Ce(Overlord) <[email protected]>; Wang, 
> Yang(Kevin) <[email protected]>
> *主题:* [PATCH v4] drm/amdgpu: Save and restore switch state
>  
> During a DPC error kernel waits for the link to be active before
> notifying downstream devices. On certain platforms with Broadcom switch
> in synthetiic mode, switch responds with values even though the link is
> not fully ready. The config space restoration done by pcie port driver
> for SWUS/DS of dGPU is thus not effective as the switch is still doing
> internal enumeration.
> 
> As a workaround, save state of SWUS/DS device in driver. Add additional
> check to see if link is active and restore the values during DPC error
> callbacks.
> 
> Signed-off-by: Lijo Lazar <[email protected]>
> ---
> 
> v2: Use usleep_range as sleep is short. Remove dev_info logs.
> v3: remove redundant increment of 'i' in loop (Ce Sun).
> v4: Add timeout for wait (Kevin Wang)

This patch regresses amdgpu init with Kaveri on amd-staging-drm-next.
 
A kernel oops leaves the system with frozen video and unable to
complete a reboot without using sysrq.  Reverting the commit restores
normal behavior.

Here's the dmesg snippet for e5e203e0cd53 after 'modprobe amdgpu dc=1':

[  492.145848] [drm] amdgpu kernel modesetting enabled.
[  492.146050] amdgpu: Virtual CRAT table created for CPU
[  492.146072] amdgpu: Topology: Add CPU node
[  492.146451] amdgpu 0000:00:01.0: amdgpu: initializing kernel modesetting 
(KAVERI 0x1002:0x130F 0x1043:0x85CB 0xD4).
[  492.146473] amdgpu 0000:00:01.0: amdgpu: register mmio base: 0xFEB00000
[  492.146477] amdgpu 0000:00:01.0: amdgpu: register mmio size: 262144
[  492.146559] amdgpu 0000:00:01.0: amdgpu: detected ip block number 0 
<cik_common>
[  492.146563] amdgpu 0000:00:01.0: amdgpu: detected ip block number 1 
<gmc_v7_0>
[  492.146566] amdgpu 0000:00:01.0: amdgpu: detected ip block number 2 <cik_ih>
[  492.146568] amdgpu 0000:00:01.0: amdgpu: detected ip block number 3 
<gfx_v7_0>
[  492.146571] amdgpu 0000:00:01.0: amdgpu: detected ip block number 4 
<cik_sdma>
[  492.146573] amdgpu 0000:00:01.0: amdgpu: detected ip block number 5 <kv_dpm>
[  492.146576] amdgpu 0000:00:01.0: amdgpu: detected ip block number 6 <dm>
[  492.146579] amdgpu 0000:00:01.0: amdgpu: detected ip block number 7 
<uvd_v4_2>
[  492.146581] amdgpu 0000:00:01.0: amdgpu: detected ip block number 8 
<vce_v2_0>
[  492.146611] amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT
[  492.146615] amdgpu: ATOM BIOS: 113-SPEC-102
[  492.151635] Console: switching to colour dummy device 80x25
[  492.151720] amdgpu 0000:00:01.0: vgaarb: deactivate vga console
[  492.151725] amdgpu 0000:00:01.0: amdgpu: Trusted Memory Zone (TMZ) feature 
not supported
[  492.151790] amdgpu 0000:00:01.0: amdgpu: vm size is 64 GB, 2 levels, block 
size is 10-bit, fragment size is 9-bit
[  492.151799] amdgpu 0000:00:01.0: amdgpu: VRAM: 256M 0x000000F400000000 - 
0x000000F40FFFFFFF (256M used)
[  492.151804] amdgpu 0000:00:01.0: amdgpu: GART: 1024M 0x000000FF00000000 - 
0x000000FF3FFFFFFF
[  492.151819] [drm] Detected VRAM RAM=256M, BAR=256M
[  492.151822] [drm] RAM width 128bits UNKNOWN
[  492.151982] amdgpu 0000:00:01.0: amdgpu: amdgpu: 256M of VRAM memory ready
[  492.151986] amdgpu 0000:00:01.0: amdgpu: amdgpu: 7849M of GTT memory ready.
[  492.152017] [drm] GART: num cpu pages 262144, num gpu pages 262144
[  492.152063] [drm] PCIE GART of 1024M enabled (table at 0x000000F400E00000).
[  492.157012] amdgpu 0000:00:01.0: amdgpu: [drm] Internal thermal controller 
without fan control
[  492.157021] amdgpu 0000:00:01.0: amdgpu: [drm] dpm initialized
[  492.160402] [drm] Found UVD firmware Version: 1.64 Family ID: 9
[  492.162619] [drm] Found VCE firmware Version: 50.10 Binary ID: 2
[  492.171961] amdgpu 0000:00:01.0: [drm] Unsupported Connector type:5!
[  492.171969] [drm:bios_parser_get_connector_id [amdgpu]] *ERROR* Can't find 
connector id 7 in connector table of size 7.
[  492.172925] [drm:bios_parser_get_connector_id [amdgpu]] *ERROR* Can't find 
connector id 8 in connector table of size 7.
[  492.173619] [drm:bios_parser_get_connector_id [amdgpu]] *ERROR* Can't find 
connector id 9 in connector table of size 7.
[  492.174311] [drm:bios_parser_get_connector_id [amdgpu]] *ERROR* Can't find 
connector id 10 in connector table of size 7.
[  492.175040] [drm:bios_parser_get_connector_id [amdgpu]] *ERROR* Can't find 
connector id 11 in connector table of size 7.
[  492.175827] [drm:bios_parser_get_connector_id [amdgpu]] *ERROR* Can't find 
connector id 12 in connector table of size 7.
[  492.176519] [drm:bios_parser_get_connector_id [amdgpu]] *ERROR* Can't find 
connector id 13 in connector table of size 7.
[  492.177351] amdgpu 0000:00:01.0: amdgpu: [drm] Display Core v3.2.344 
initialized on DCE 8.1
[  492.199366] snd_hda_intel 0000:00:01.1: bound 0000:00:01.0 (ops 
amdgpu_dm_audio_component_bind_ops [amdgpu])
[  492.250812] [drm] UVD initialized successfully.
[  492.371074] [drm] VCE initialized successfully.
[  492.375051] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[  492.375074] kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
[  492.375192] amdgpu: Virtual CRAT table created for GPU
[  492.375275] amdgpu: Topology: Add dGPU node [0x130f:0x1002]
[  492.375278] kfd kfd: amdgpu: added device 1002:130f
[  492.375290] amdgpu 0000:00:01.0: amdgpu: SE 1, SH per SE 1, CU per SH 8, 
active_cu_number 8
[  492.377072] BUG: kernel NULL pointer dereference, address: 000000000000003c
[  492.377079] #PF: supervisor read access in kernel mode
[  492.377083] #PF: error_code(0x0000) - not-present page
[  492.377088] PGD 0 P4D 0 
[  492.377093] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[  492.377100] CPU: 3 UID: 0 PID: 1739 Comm: modprobe Not tainted 
6.14.0-bisect-kaveri-oops #10
[  492.377108] Hardware name: System manufacturer System Product Name/A88X-PRO, 
BIOS 2001 04/30/2015
[  492.377116] RIP: 0010:amdgpu_device_cache_pci_state+0x7d/0x100 [amdgpu]
[  492.378216] Code: 84 9d 49 6a 00 48 8b 45 f8 f6 80 3c 08 00 00 01 74 07 48 
8b 80 f0 09 00 00 48 8b 40 10 48 8b 58 10 48 85 db 74 04 48 8b 58 38 <66> 81 7b 
3c 02 10 74 07 b8 01 00 00 00 eb 91 48 83 bd 18 5f 05 00
[  492.378226] RSP: 0018:ffffa74fc1a137b8 EFLAGS: 00010246
[  492.378233] RAX: ffff9b7881a0b000 RBX: 0000000000000000 RCX: 0000000000000074
[  492.378238] RDX: 000000000000005e RSI: ffff9b7881291e10 RDI: ffff9b7885402e36
[  492.378242] RBP: ffff9b7887300010 R08: 0000000e00000010 R09: 0000000000002930
[  492.378246] R10: 0000000029300000 R11: 0000000000000000 R12: 00000000000004a8
[  492.378251] R13: 0000000000000000 R14: ffff9b7880180000 R15: 0000000000000000
[  492.378255] FS:  00007f9d9e2acf00(0000) GS:ffff9b7b90180000(0000) 
knlGS:0000000000000000
[  492.378261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  492.378265] CR2: 000000000000003c CR3: 0000000104950000 CR4: 00000000000506f0
[  492.378270] Call Trace:
[  492.378277]  <TASK>
[  492.378283]  ? __die_body.cold+0x19/0x27
[  492.378292]  ? page_fault_oops+0x15c/0x2e0
[  492.378298]  ? search_module_extables+0x19/0x60
[  492.378306]  ? search_bpf_extables+0x5f/0x80
[  492.378313]  ? exc_page_fault+0x7e/0x1a0
[  492.378319]  ? asm_exc_page_fault+0x26/0x30
[  492.378328]  ? amdgpu_device_cache_pci_state+0x7d/0x100 [amdgpu]
[  492.378951]  ? amdgpu_device_cache_pci_state+0x48/0x100 [amdgpu]
[  492.379442]  amdgpu_device_init.cold+0x1f59/0x2412 [amdgpu]
[  492.380088]  amdgpu_driver_load_kms+0x13/0x70 [amdgpu]
[  492.380582]  amdgpu_pci_probe+0x1e1/0x480 [amdgpu]
[  492.381068]  local_pci_probe+0x45/0x90
[  492.381075]  pci_device_probe+0xdd/0x270
[  492.381082]  really_probe+0xde/0x340
[  492.381087]  ? pm_runtime_barrier+0x54/0x90
[  492.381093]  ? __pfx___driver_attach+0x10/0x10
[  492.381097]  __driver_probe_device+0x78/0x110
[  492.381103]  driver_probe_device+0x1f/0xa0
[  492.381107]  __driver_attach+0xba/0x1c0
[  492.381112]  bus_for_each_dev+0x8f/0xe0
[  492.381119]  bus_add_driver+0x112/0x1f0
[  492.381126]  driver_register+0x72/0xd0
[  492.381131]  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
[  492.381616]  do_one_initcall+0x5b/0x310
[  492.381624]  do_init_module+0x60/0x230
[  492.381629]  init_module_from_file+0x89/0xe0
[  492.381637]  idempotent_init_module+0x115/0x310
[  492.381644]  __x64_sys_finit_module+0x65/0xc0
[  492.381650]  do_syscall_64+0x82/0x190
[  492.381656]  ? vfs_read+0x164/0x370
[  492.381661]  ? vfs_read+0x164/0x370
[  492.381665]  ? __rseq_handle_notify_resume+0xa2/0x4a0
[  492.381671]  ? restore_fpregs_from_fpstate+0x3c/0xa0
[  492.381678]  ? switch_fpu_return+0x4e/0xd0
[  492.381684]  ? syscall_exit_to_user_mode+0x172/0x210
[  492.381689]  ? do_syscall_64+0x8e/0x190
[  492.381694]  ? vfs_statx+0x81/0x120
[  492.381700]  ? vfs_fstatat+0x75/0xa0
[  492.381705]  ? __do_sys_newfstatat+0x3c/0x80
[  492.381712]  ? syscall_exit_to_user_mode+0x4d/0x210
[  492.381717]  ? do_syscall_64+0x8e/0x190
[  492.381721]  ? do_syscall_64+0x8e/0x190
[  492.381726]  ? do_syscall_64+0x8e/0x190
[  492.381731]  ? exc_page_fault+0x7e/0x1a0
[  492.381735]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  492.381742] RIP: 0033:0x7f9d9db12779
[  492.381760] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 4f 86 0d 00 f7 d8 64 89 01 48
[  492.381768] RSP: 002b:00007ffe4cd71358 EFLAGS: 00000246 ORIG_RAX: 
0000000000000139
[  492.381774] RAX: ffffffffffffffda RBX: 0000564c05216d70 RCX: 00007f9d9db12779
[  492.381779] RDX: 0000000000000004 RSI: 0000564c0521bbf0 RDI: 0000000000000007
[  492.381783] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
[  492.381787] R10: 0000000000000000 R11: 0000000000000246 R12: 0000564c0521bbf0
[  492.381791] R13: 0000000000040000 R14: 0000564c05216e90 R15: 0000000000000000
[  492.381798]  </TASK>
[  492.381801] Modules linked in: amdgpu(+) amdxcp gpu_sched 
drm_panel_backlight_quirks drm_buddy sch_cake ifb sch_htb cls_u32 cls_flow 
cls_fw act_mirred sch_ingress ip6t_REJECT nf_reject_ipv6 nft_chain_nat 
nft_limit xt_iprange xt_MASQUERADE xt_addrtype xt_nat nf_nat xt_LOG 
nf_log_syslog xt_limit xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_compat nf_tables binfmt_misc 
nls_ascii nls_cp437 vfat fat edac_mce_amd snd_hda_codec_realtek kvm_amd 
snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_scodec_component ccp 
snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm eeepc_wmi snd_hda_codec 
asus_wmi sparse_keymap snd_hda_core snd_hwdep platform_profile snd_pcm battery 
rfkill wmi_bmof snd_timer pcspkr k10temp fam15h_power at24 snd soundcore evdev 
sg nct6775 nct6775_core hwmon_vid drivetemp efi_pstore configfs nfnetlink 
efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 dm_crypt dm_mod 
hid_generic usbhid hid radeon drm_ttm_helper sd_mod ttm ghash_clmulni_intel
[  492.381896]  sha512_ssse3 drm_exec i2c_algo_bit sha256_ssse3 
drm_suballoc_helper sha1_ssse3 drm_display_helper cec rc_core aesni_intel ahci 
drm_client_lib ohci_pci drm_kms_helper crypto_simd libahci ehci_pci xhci_pci 
libata ehci_hcd cryptd ohci_hcd e1000e xhci_hcd drm sp5100_tco usbcore scsi_mod 
watchdog i2c_piix4 i2c_smbus usb_common scsi_common video wmi button
[  492.381970] CR2: 000000000000003c
[  492.382003] ---[ end trace 0000000000000000 ]---

Thanks,
John

> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |  3 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 85 ++++++++++++++++++++--
>  2 files changed, 83 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index e4ecce1c4196..c8fe3e34e784 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -918,6 +918,9 @@ struct amdgpu_pcie_reset_ctx {
>          bool in_link_reset;
>          bool occurs_dpc;
>          bool audio_suspended;
> +       struct pci_dev *swus;
> +       struct pci_saved_state *swus_pcistate;
> +       struct pci_saved_state *swds_pcistate;
>  };
>  
>  /*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 26706fab0de9..0e8c17f328e5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -178,6 +178,8 @@ struct amdgpu_init_level amdgpu_init_minimal_xgmi = {
>                  BIT(AMD_IP_BLOCK_TYPE_PSP)
>  };
>  
> +static void amdgpu_device_load_switch_state(struct amdgpu_device *adev);
> +
>  static inline bool amdgpu_ip_member_of_hwini(struct amdgpu_device *adev,
>                                               enum amd_ip_block_type block)
>  {
> @@ -5027,7 +5029,8 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
>          adev->reset_domain = NULL;
>  
>          kfree(adev->pci_state);
> -
> +       kfree(adev->pcie_reset_ctx.swds_pcistate);
> +       kfree(adev->pcie_reset_ctx.swus_pcistate);
>  }
>  
>  /**
> @@ -6985,16 +6988,34 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev 
> *pdev)
>          struct amdgpu_device *tmp_adev;
>          struct amdgpu_hive_info *hive;
>          struct list_head device_list;
> -       int r = 0, i;
> +       struct pci_dev *link_dev;
> +       int r = 0, i, timeout;
>          u32 memsize;
> +       u16 status;
>  
>          dev_info(adev->dev, "PCI error: slot reset callback!!\n");
>  
>          memset(&reset_context, 0, sizeof(reset_context));
>  
> -       /* wait for asic to come out of reset */
> -       msleep(700);
> +       if (adev->pcie_reset_ctx.swus)
> +               link_dev = adev->pcie_reset_ctx.swus;
> +       else
> +               link_dev = adev->pdev;
> +       /* wait for asic to come out of reset, timeout = 10s */
> +       timeout = 10000;
> +       do {
> +               usleep_range(10000, 10500);
> +               r = pci_read_config_word(link_dev, PCI_VENDOR_ID, &status);
> +               timeout -= 10;
> +       } while (timeout > 0 && (status != PCI_VENDOR_ID_ATI) &&
> +                (status != PCI_VENDOR_ID_AMD));
>  
> +       if ((status != PCI_VENDOR_ID_ATI) && (status != PCI_VENDOR_ID_AMD)) {
> +               r = -ETIME;
> +               goto out;
> +       }
> +
> +       amdgpu_device_load_switch_state(adev);
>          /* Restore PCI confspace */
>          amdgpu_device_load_pci_state(pdev);
>  
> @@ -7096,6 +7117,58 @@ void amdgpu_pci_resume(struct pci_dev *pdev)
>          }
>  }
>  
> +static void amdgpu_device_cache_switch_state(struct amdgpu_device *adev)
> +{
> +       struct pci_dev *parent = pci_upstream_bridge(adev->pdev);
> +       int r;
> +
> +       if (parent->vendor != PCI_VENDOR_ID_ATI)
> +               return;
> +
> +       /* If already saved, return */
> +       if (adev->pcie_reset_ctx.swus)
> +               return;
> +       /* Upstream bridge is ATI, assume it's SWUS/DS architecture */
> +       r = pci_save_state(parent);
> +       if (r)
> +               return;
> +       adev->pcie_reset_ctx.swds_pcistate = pci_store_saved_state(parent);
> +
> +       parent = pci_upstream_bridge(parent);
> +       r = pci_save_state(parent);
> +       if (r)
> +               return;
> +       adev->pcie_reset_ctx.swus_pcistate = pci_store_saved_state(parent);
> +
> +       adev->pcie_reset_ctx.swus = parent;
> +}
> +
> +static void amdgpu_device_load_switch_state(struct amdgpu_device *adev)
> +{
> +       struct pci_dev *pdev;
> +       int r;
> +
> +       if (!adev->pcie_reset_ctx.swds_pcistate ||
> +           !adev->pcie_reset_ctx.swus_pcistate)
> +               return;
> +
> +       pdev = adev->pcie_reset_ctx.swus;
> +       r = pci_load_saved_state(pdev, adev->pcie_reset_ctx.swus_pcistate);
> +       if (!r) {
> +               pci_restore_state(pdev);
> +       } else {
> +               dev_warn(adev->dev, "Failed to load SWUS state, err:%d\n", r);
> +               return;
> +       }
> +
> +       pdev = pci_upstream_bridge(adev->pdev);
> +       r = pci_load_saved_state(pdev, adev->pcie_reset_ctx.swds_pcistate);
> +       if (!r)
> +               pci_restore_state(pdev);
> +       else
> +               dev_warn(adev->dev, "Failed to load SWDS state, err:%d\n", r);
> +}
> +
>  bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
>  {
>          struct drm_device *dev = pci_get_drvdata(pdev);
> @@ -7120,6 +7193,8 @@ bool amdgpu_device_cache_pci_state(struct pci_dev *pdev)
>                  return false;
>          }
>  
> +       amdgpu_device_cache_switch_state(adev);
> +
>          return true;
>  }
>  
> @@ -7555,4 +7630,4 @@ u64 amdgpu_device_get_uid(struct amdgpu_uid *uid_info,
>          }
>  
>          return uid_info->uid[type][inst];
> -}
> \ No newline at end of file
> +}
> -- 
> 2.49.0
> 

Reply via email to