Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Alex wrote: > On Thu, Jan 6, 2022 at 10:38 AM Yann Dirson wrote: > > > > Alex wrote: > > > > How is the stolen memory communicated to the driver ? That > > > > host > > > > physical > > > > memory probably has to be mapped at the same guest physical > > > > address > > > > for > > > > the magic to work, right ? > > > > > > Correct. The driver reads the physical location of that memory > > > from > > > hardware registers. Removing this chunk of code from gmc_v9_0.c > > > will > > > force the driver to use the BAR, > > > > That would only be a workaround for a missing mapping of stolen > > memory to the guest, right ? > > > Correct. That will use the PCI BAR rather than the underlying > physical > memory for CPU access to the carve out region. > > > > > > > > but I'm not sure if there are any > > > other places in the driver that make assumptions about using the > > > physical host address or not on APUs off hand. > > > > gmc_v9_0_vram_gtt_location() updates vm_manager.vram_base_offset > > from > > the same value. I'm not sure I understand why in this case there > > is > > no reason to use the BAR while there are some in > > gmc_v9_0_mc_init(). > > > > vram_base_offset then gets used in several places: > > > > * amdgpu_gmc_init_pdb0, that seems likely enough to be problematic, > > right ? > > As a sidenote the XGMI offset added earlier gets substracted > > here to deduce vram base addr > > (a couple of new acronyms there: PDB, PDE -- page directory > > base/entry?) > > > > * amdgpu_ttm_map_buffer, amdgpu_vm_bo_update_mapping: those seem to > > be > > as problematic > > > > * amdgpu_gmc_vram_mc2pa: until I got there I had assumed MC could > > stand for > > "memory controller", but then "MC address of buffer" makes me > > doubt > > > > > > MC = memory controller (as in graphics memory controller). > > These are GPU addresses not CPU addresses so they should be fine. > > > > > > > if ((adev->flags & AMD_IS_APU) || > > > (adev->gmc.xgmi.supported && > > > adev->gmc.xgmi.connected_to_cpu)) { > > > adev->gmc.aper_base = > > > adev->gfxhub.funcs->get_mc_fb_offset(adev) > > > + > > > adev->gmc.xgmi.physical_node_id * > > > adev->gmc.xgmi.node_segment_size; > > > adev->gmc.aper_size = adev->gmc.real_vram_size; > > > } > > > > > > Now for the test... it does indeed seem to go much further, I even > > loose the dom0's efifb to that black screen hopefully showing the > > driver started to setup the hardware. Will probably still have to > > hunt down whether it still tries to use efifb afterwards (can't see > > why it would not, TBH, given the previous behaviour where it kept > > using it after the guest failed to start). > > > > The log shows many details about TMR loading > > > > Then as expected: > > > > [2022-01-06 15:16:09] <6>[5.844589] amdgpu :00:05.0: > > amdgpu: RAP: optional rap ta ucode is not available > > [2022-01-06 15:16:09] <6>[5.844619] amdgpu :00:05.0: > > amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available > > [2022-01-06 15:16:09] <7>[5.844639] > > [drm:amdgpu_device_init.cold [amdgpu]] hw_init (phase2) of IP > > block ... > > [2022-01-06 15:16:09] <6>[5.845515] amdgpu :00:05.0: > > amdgpu: SMU is initialized successfully! > > > > > > not sure about that unhandled interrupt (and a bit worried about > > messed-up logs): > > > > [2022-01-06 15:16:09] <7>[6.010681] amdgpu :00:05.0: > > [drm:amdgpu_ring_test_hel[2022-01-06 15:16:10] per [amdgpu]] ring > > test on sdma0 succeeded > > [2022-01-06 15:16:10] <7>[6.010831] [drm:amdgpu_ih_process > > [amdgpu]] amdgpu_ih_process: rptr 0, wptr 32 > > [2022-01-06 15:16:10] <7>[6.011002] [drm:amdgpu_irq_dispatch > > [amdgpu]] Unhandled interrupt src_id: 243 > > > > > > then comes a first error: > > > > [2022-01-06 15:16:10] <6>[6.011785] [drm] Display Core > > initialized with v3.2.149! > > [2022-01-06 15:16:10] <6>[6.012714] [drm] DMUB hardware > > initialized: version=0x0101001C > > [2022-01-06 15:16:10] <3>[6.228263] [drm:dc_dmub_srv_wait_idle > > [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3 > > [2022-01-06 15:16:10] <7>[6.229125] > > [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: freesync_module > > init done 76c7b459. > > [2022-01-06 15:16:10] <7>[6.229677] > > [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: hdcp_workqueue > > init done 87e28b47. > > [2022-01-06 15:16:10] <7>[6.229979] > > [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] > > amdgpu_dm_connector_init() > > > > ... which we can see again several times later though the driver > > seems sufficient to finish init: > > > > [2022-01-06 15:16:10] <6>[6.615615] [drm] late_init of IP block > > ... > > [2022-01-06 15:16:10] <6>[6.615772] [drm] late_init of IP block > > ... > > [2022-01-06 15:16:10]
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Thu, Jan 6, 2022 at 10:38 AM Yann Dirson wrote: > > Alex wrote: > > > How is the stolen memory communicated to the driver ? That host > > > physical > > > memory probably has to be mapped at the same guest physical address > > > for > > > the magic to work, right ? > > > > Correct. The driver reads the physical location of that memory from > > hardware registers. Removing this chunk of code from gmc_v9_0.c will > > force the driver to use the BAR, > > That would only be a workaround for a missing mapping of stolen > memory to the guest, right ? Correct. That will use the PCI BAR rather than the underlying physical memory for CPU access to the carve out region. > > > > but I'm not sure if there are any > > other places in the driver that make assumptions about using the > > physical host address or not on APUs off hand. > > gmc_v9_0_vram_gtt_location() updates vm_manager.vram_base_offset from > the same value. I'm not sure I understand why in this case there is > no reason to use the BAR while there are some in gmc_v9_0_mc_init(). > > vram_base_offset then gets used in several places: > > * amdgpu_gmc_init_pdb0, that seems likely enough to be problematic, > right ? > As a sidenote the XGMI offset added earlier gets substracted > here to deduce vram base addr > (a couple of new acronyms there: PDB, PDE -- page directory base/entry?) > > * amdgpu_ttm_map_buffer, amdgpu_vm_bo_update_mapping: those seem to be > as problematic > > * amdgpu_gmc_vram_mc2pa: until I got there I had assumed MC could stand for > "memory controller", but then "MC address of buffer" makes me doubt > > MC = memory controller (as in graphics memory controller). These are GPU addresses not CPU addresses so they should be fine. > > > > if ((adev->flags & AMD_IS_APU) || > > (adev->gmc.xgmi.supported && > > adev->gmc.xgmi.connected_to_cpu)) { > > adev->gmc.aper_base = > > adev->gfxhub.funcs->get_mc_fb_offset(adev) + > > adev->gmc.xgmi.physical_node_id * > > adev->gmc.xgmi.node_segment_size; > > adev->gmc.aper_size = adev->gmc.real_vram_size; > > } > > > Now for the test... it does indeed seem to go much further, I even > loose the dom0's efifb to that black screen hopefully showing the > driver started to setup the hardware. Will probably still have to > hunt down whether it still tries to use efifb afterwards (can't see > why it would not, TBH, given the previous behaviour where it kept > using it after the guest failed to start). > > The log shows many details about TMR loading > > Then as expected: > > [2022-01-06 15:16:09] <6>[5.844589] amdgpu :00:05.0: amdgpu: RAP: > optional rap ta ucode is not available > [2022-01-06 15:16:09] <6>[5.844619] amdgpu :00:05.0: amdgpu: > SECUREDISPLAY: securedisplay ta ucode is not available > [2022-01-06 15:16:09] <7>[5.844639] [drm:amdgpu_device_init.cold > [amdgpu]] hw_init (phase2) of IP block ... > [2022-01-06 15:16:09] <6>[5.845515] amdgpu :00:05.0: amdgpu: SMU is > initialized successfully! > > > not sure about that unhandled interrupt (and a bit worried about messed-up > logs): > > [2022-01-06 15:16:09] <7>[6.010681] amdgpu :00:05.0: > [drm:amdgpu_ring_test_hel[2022-01-06 15:16:10] per [amdgpu]] ring test on > sdma0 succeeded > [2022-01-06 15:16:10] <7>[6.010831] [drm:amdgpu_ih_process [amdgpu]] > amdgpu_ih_process: rptr 0, wptr 32 > [2022-01-06 15:16:10] <7>[6.011002] [drm:amdgpu_irq_dispatch [amdgpu]] > Unhandled interrupt src_id: 243 > > > then comes a first error: > > [2022-01-06 15:16:10] <6>[6.011785] [drm] Display Core initialized with > v3.2.149! > [2022-01-06 15:16:10] <6>[6.012714] [drm] DMUB hardware initialized: > version=0x0101001C > [2022-01-06 15:16:10] <3>[6.228263] [drm:dc_dmub_srv_wait_idle [amdgpu]] > *ERROR* Error waiting for DMUB idle: status=3 > [2022-01-06 15:16:10] <7>[6.229125] [drm:amdgpu_dm_init.isra.0.cold > [amdgpu]] amdgpu: freesync_module init done 76c7b459. > [2022-01-06 15:16:10] <7>[6.229677] [drm:amdgpu_dm_init.isra.0.cold > [amdgpu]] amdgpu: hdcp_workqueue init done 87e28b47. > [2022-01-06 15:16:10] <7>[6.229979] [drm:amdgpu_dm_init.isra.0.cold > [amdgpu]] amdgpu_dm_connector_init() > > ... which we can see again several times later though the driver seems > sufficient to finish init: > > [2022-01-06 15:16:10] <6>[6.615615] [drm] late_init of IP block ... > [2022-01-06 15:16:10] <6>[6.615772] [drm] late_init of IP block > ... > [2022-01-06 15:16:10] <6>[6.615801] [drm] late_init of IP block > ... > [2022-01-06 15:16:10] <6>[6.615827] [drm] late_init of IP block ... > [2022-01-06 15:16:10] <3>[6.801790] [drm:dc_dmub_srv_wait_idle [amdgpu]] > *ERROR* Error waiting for DMUB idle: status=3 > [2022-01-06 15:16:10] <7>[6.806079] [drm:drm_minor_register [drm]] > [2022-01-06
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Alex wrote: > > How is the stolen memory communicated to the driver ? That host > > physical > > memory probably has to be mapped at the same guest physical address > > for > > the magic to work, right ? > > Correct. The driver reads the physical location of that memory from > hardware registers. Removing this chunk of code from gmc_v9_0.c will > force the driver to use the BAR, That would only be a workaround for a missing mapping of stolen memory to the guest, right ? > but I'm not sure if there are any > other places in the driver that make assumptions about using the > physical host address or not on APUs off hand. gmc_v9_0_vram_gtt_location() updates vm_manager.vram_base_offset from the same value. I'm not sure I understand why in this case there is no reason to use the BAR while there are some in gmc_v9_0_mc_init(). vram_base_offset then gets used in several places: * amdgpu_gmc_init_pdb0, that seems likely enough to be problematic, right ? As a sidenote the XGMI offset added earlier gets substracted here to deduce vram base addr (a couple of new acronyms there: PDB, PDE -- page directory base/entry?) * amdgpu_ttm_map_buffer, amdgpu_vm_bo_update_mapping: those seem to be as problematic * amdgpu_gmc_vram_mc2pa: until I got there I had assumed MC could stand for "memory controller", but then "MC address of buffer" makes me doubt > > if ((adev->flags & AMD_IS_APU) || > (adev->gmc.xgmi.supported && > adev->gmc.xgmi.connected_to_cpu)) { > adev->gmc.aper_base = > adev->gfxhub.funcs->get_mc_fb_offset(adev) + > adev->gmc.xgmi.physical_node_id * > adev->gmc.xgmi.node_segment_size; > adev->gmc.aper_size = adev->gmc.real_vram_size; > } Now for the test... it does indeed seem to go much further, I even loose the dom0's efifb to that black screen hopefully showing the driver started to setup the hardware. Will probably still have to hunt down whether it still tries to use efifb afterwards (can't see why it would not, TBH, given the previous behaviour where it kept using it after the guest failed to start). The log shows many details about TMR loading Then as expected: [2022-01-06 15:16:09] <6>[5.844589] amdgpu :00:05.0: amdgpu: RAP: optional rap ta ucode is not available [2022-01-06 15:16:09] <6>[5.844619] amdgpu :00:05.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [2022-01-06 15:16:09] <7>[5.844639] [drm:amdgpu_device_init.cold [amdgpu]] hw_init (phase2) of IP block ... [2022-01-06 15:16:09] <6>[5.845515] amdgpu :00:05.0: amdgpu: SMU is initialized successfully! not sure about that unhandled interrupt (and a bit worried about messed-up logs): [2022-01-06 15:16:09] <7>[6.010681] amdgpu :00:05.0: [drm:amdgpu_ring_test_hel[2022-01-06 15:16:10] per [amdgpu]] ring test on sdma0 succeeded [2022-01-06 15:16:10] <7>[6.010831] [drm:amdgpu_ih_process [amdgpu]] amdgpu_ih_process: rptr 0, wptr 32 [2022-01-06 15:16:10] <7>[6.011002] [drm:amdgpu_irq_dispatch [amdgpu]] Unhandled interrupt src_id: 243 then comes a first error: [2022-01-06 15:16:10] <6>[6.011785] [drm] Display Core initialized with v3.2.149! [2022-01-06 15:16:10] <6>[6.012714] [drm] DMUB hardware initialized: version=0x0101001C [2022-01-06 15:16:10] <3>[6.228263] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3 [2022-01-06 15:16:10] <7>[6.229125] [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: freesync_module init done 76c7b459. [2022-01-06 15:16:10] <7>[6.229677] [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: hdcp_workqueue init done 87e28b47. [2022-01-06 15:16:10] <7>[6.229979] [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu_dm_connector_init() ... which we can see again several times later though the driver seems sufficient to finish init: [2022-01-06 15:16:10] <6>[6.615615] [drm] late_init of IP block ... [2022-01-06 15:16:10] <6>[6.615772] [drm] late_init of IP block ... [2022-01-06 15:16:10] <6>[6.615801] [drm] late_init of IP block ... [2022-01-06 15:16:10] <6>[6.615827] [drm] late_init of IP block ... [2022-01-06 15:16:10] <3>[6.801790] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3 [2022-01-06 15:16:10] <7>[6.806079] [drm:drm_minor_register [drm]] [2022-01-06 15:16:10] <7>[6.806195] [drm:drm_minor_register [drm]] new minor registered 128 [2022-01-06 15:16:10] <7>[6.806223] [drm:drm_minor_register [drm]] [2022-01-06 15:16:10] <7>[6.806289] [drm:drm_minor_register [drm]] new minor registered 0 [2022-01-06 15:16:10] <7>[6.806355] [drm:drm_sysfs_connector_add [drm]] adding "eDP-1" to sysfs [2022-01-06 15:16:10] <7>[6.806424] [drm:drm_dp_aux_register_devnode [drm_kms_helper]] drm_dp_aux_dev: aux [AMDGPU DM aux hw bus 0]
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Wed, Dec 29, 2021 at 12:34 PM Yann Dirson wrote: > > Alex wrote: > > On Wed, Dec 29, 2021 at 11:59 AM Yann Dirson wrote: > > > > > > Alex wrote: > > > > On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson > > > > wrote: > > > > > > > > > > > > > > > > > > > > - Mail original - > > > > > > De: "Alex Deucher" > > > > > > À: "Yann Dirson" > > > > > > Cc: "Christian König" , > > > > > > "amd-gfx list" > > > > > > Envoyé: Mardi 21 Décembre 2021 23:31:01 > > > > > > Objet: Re: Various problems trying to vga-passthrough a > > > > > > Renoir > > > > > > iGPU to a xen/qubes-os hvm > > > > > > > > > > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Alex wrote: > > > > > > > > > > > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Christian wrote: > > > > > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > > > > > > > > Alex wrote: > > > > > > > > > > >> Thinking about this more, I think the problem > > > > > > > > > > >> might be > > > > > > > > > > >> related > > > > > > > > > > >> to > > > > > > > > > > >> CPU > > > > > > > > > > >> access to "VRAM". APUs don't have dedicated VRAM, > > > > > > > > > > >> they > > > > > > > > > > >> use a > > > > > > > > > > >> reserved > > > > > > > > > > >> carve out region at the top of system memory. For > > > > > > > > > > >> CPU > > > > > > > > > > >> access > > > > > > > > > > >> to > > > > > > > > > > >> this > > > > > > > > > > >> memory, we kmap the physical address of the carve > > > > > > > > > > >> out > > > > > > > > > > >> region > > > > > > > > > > >> of > > > > > > > > > > >> system > > > > > > > > > > >> memory. You'll need to make sure that region is > > > > > > > > > > >> accessible to > > > > > > > > > > >> the > > > > > > > > > > >> guest. > > > > > > > > > > > So basically, the non-virt flow is is: (video?) > > > > > > > > > > > BIOS > > > > > > > > > > > reserves > > > > > > > > > > > memory, marks it > > > > > > > > > > > as reserved in e820, stores the physaddr somewhere, > > > > > > > > > > > which > > > > > > > > > > > the > > > > > > > > > > > GPU > > > > > > > > > > > driver gets. > > > > > > > > > > > Since I suppose this includes the framebuffer, this > > > > > > > > > > > probably > > > > > > > > > > > has to > > > > > > > > > > > occur around > > > > > > > > > > > the moment the driver calls > > > > > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > > > > > > > > (which happens before this hw init step), right ? > > > > > > > > > > > > > > > > > > > > Well, that partially correct. The efifb is using the > > > > > > > > > > PCIe > > > > > > > > > > resources > > > > > > > > > > to > > > > > > > > > > access the framebuffer and as far as I know we use > > > > > > > > > > that > > > > > > > > > > one > > > > > > > > > > to > > > > > > > > > > kick > > > > > > > > > > it out. > > > > > > > > > > > > > > > > > > > > The stolen memory we get over e820/registers is > > > > > > > > > > separate > > > > > > > > > > to > > > > > > > > > > that. > > > > > > > > > > > > > > How is the stolen memory communicated to the driver ? That > > > > > > > host > > > > > > > physical > > > > > > > memory probably has to be mapped at the same guest physical > > > > > > > address > > > > > > > for > > > > > > > the magic to work, right ? > > > > > > > > > > > > Correct. The driver reads the physical location of that > > > > > > memory > > > > > > from > > > > > > hardware registers. Removing this chunk of code from > > > > > > gmc_v9_0.c > > > > > > will > > > > > > force the driver to use the BAR, but I'm not sure if there > > > > > > are > > > > > > any > > > > > > other places in the driver that make assumptions about using > > > > > > the > > > > > > physical host address or not on APUs off hand. > > > > > > > > > > > > if ((adev->flags & AMD_IS_APU) || > > > > > > (adev->gmc.xgmi.supported && > > > > > > adev->gmc.xgmi.connected_to_cpu)) { > > > > > > adev->gmc.aper_base = > > > > > > adev->gfxhub.funcs->get_mc_fb_offset(adev) > > > > > > + > > > > > > adev->gmc.xgmi.physical_node_id * > > > > > > adev->gmc.xgmi.node_segment_size; > > > > > > adev->gmc.aper_size = > > > > > > adev->gmc.real_vram_size; > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ... which brings me to a point that's been puzzling > > > > > > > > > > > me > > > > > > > > > > > for > > > > > > > > > > > some > > > > > > > > > > > time, which is > > > > > > > > > > > that as the hw init fails, the efifb driver is > > > > > > > > > > > still > > > > > > > > > > > using > > > > > > > > > > > the > > > > > > > > > > > framebuffer. > > > > > > > > > > > > > > > > > > > > No, it
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Alex wrote: > On Wed, Dec 29, 2021 at 11:59 AM Yann Dirson wrote: > > > > Alex wrote: > > > On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson > > > wrote: > > > > > > > > > > > > > > > > - Mail original - > > > > > De: "Alex Deucher" > > > > > À: "Yann Dirson" > > > > > Cc: "Christian König" , > > > > > "amd-gfx list" > > > > > Envoyé: Mardi 21 Décembre 2021 23:31:01 > > > > > Objet: Re: Various problems trying to vga-passthrough a > > > > > Renoir > > > > > iGPU to a xen/qubes-os hvm > > > > > > > > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson > > > > > wrote: > > > > > > > > > > > > > > > > > > Alex wrote: > > > > > > > > > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > Christian wrote: > > > > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > > > > > > > Alex wrote: > > > > > > > > > >> Thinking about this more, I think the problem > > > > > > > > > >> might be > > > > > > > > > >> related > > > > > > > > > >> to > > > > > > > > > >> CPU > > > > > > > > > >> access to "VRAM". APUs don't have dedicated VRAM, > > > > > > > > > >> they > > > > > > > > > >> use a > > > > > > > > > >> reserved > > > > > > > > > >> carve out region at the top of system memory. For > > > > > > > > > >> CPU > > > > > > > > > >> access > > > > > > > > > >> to > > > > > > > > > >> this > > > > > > > > > >> memory, we kmap the physical address of the carve > > > > > > > > > >> out > > > > > > > > > >> region > > > > > > > > > >> of > > > > > > > > > >> system > > > > > > > > > >> memory. You'll need to make sure that region is > > > > > > > > > >> accessible to > > > > > > > > > >> the > > > > > > > > > >> guest. > > > > > > > > > > So basically, the non-virt flow is is: (video?) > > > > > > > > > > BIOS > > > > > > > > > > reserves > > > > > > > > > > memory, marks it > > > > > > > > > > as reserved in e820, stores the physaddr somewhere, > > > > > > > > > > which > > > > > > > > > > the > > > > > > > > > > GPU > > > > > > > > > > driver gets. > > > > > > > > > > Since I suppose this includes the framebuffer, this > > > > > > > > > > probably > > > > > > > > > > has to > > > > > > > > > > occur around > > > > > > > > > > the moment the driver calls > > > > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > > > > > > > (which happens before this hw init step), right ? > > > > > > > > > > > > > > > > > > Well, that partially correct. The efifb is using the > > > > > > > > > PCIe > > > > > > > > > resources > > > > > > > > > to > > > > > > > > > access the framebuffer and as far as I know we use > > > > > > > > > that > > > > > > > > > one > > > > > > > > > to > > > > > > > > > kick > > > > > > > > > it out. > > > > > > > > > > > > > > > > > > The stolen memory we get over e820/registers is > > > > > > > > > separate > > > > > > > > > to > > > > > > > > > that. > > > > > > > > > > > > How is the stolen memory communicated to the driver ? That > > > > > > host > > > > > > physical > > > > > > memory probably has to be mapped at the same guest physical > > > > > > address > > > > > > for > > > > > > the magic to work, right ? > > > > > > > > > > Correct. The driver reads the physical location of that > > > > > memory > > > > > from > > > > > hardware registers. Removing this chunk of code from > > > > > gmc_v9_0.c > > > > > will > > > > > force the driver to use the BAR, but I'm not sure if there > > > > > are > > > > > any > > > > > other places in the driver that make assumptions about using > > > > > the > > > > > physical host address or not on APUs off hand. > > > > > > > > > > if ((adev->flags & AMD_IS_APU) || > > > > > (adev->gmc.xgmi.supported && > > > > > adev->gmc.xgmi.connected_to_cpu)) { > > > > > adev->gmc.aper_base = > > > > > adev->gfxhub.funcs->get_mc_fb_offset(adev) > > > > > + > > > > > adev->gmc.xgmi.physical_node_id * > > > > > adev->gmc.xgmi.node_segment_size; > > > > > adev->gmc.aper_size = > > > > > adev->gmc.real_vram_size; > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ... which brings me to a point that's been puzzling > > > > > > > > > > me > > > > > > > > > > for > > > > > > > > > > some > > > > > > > > > > time, which is > > > > > > > > > > that as the hw init fails, the efifb driver is > > > > > > > > > > still > > > > > > > > > > using > > > > > > > > > > the > > > > > > > > > > framebuffer. > > > > > > > > > > > > > > > > > > No, it isn't. You are probably just still seeing the > > > > > > > > > same > > > > > > > > > screen. > > > > > > > > > > > > > > > > > > The issue is most likely that while efi was kicked > > > > > > > > > out > > > > > > > > > nobody > > > > > > > > > re-programmed the display hardware to show something > > > > > > > > > different.
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Wed, Dec 29, 2021 at 11:59 AM Yann Dirson wrote: > > Alex wrote: > > On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson wrote: > > > > > > > > > > > > - Mail original - > > > > De: "Alex Deucher" > > > > À: "Yann Dirson" > > > > Cc: "Christian König" , > > > > "amd-gfx list" > > > > Envoyé: Mardi 21 Décembre 2021 23:31:01 > > > > Objet: Re: Various problems trying to vga-passthrough a Renoir > > > > iGPU to a xen/qubes-os hvm > > > > > > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson > > > > wrote: > > > > > > > > > > > > > > > Alex wrote: > > > > > > > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > Christian wrote: > > > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > > > > > > Alex wrote: > > > > > > > > >> Thinking about this more, I think the problem might be > > > > > > > > >> related > > > > > > > > >> to > > > > > > > > >> CPU > > > > > > > > >> access to "VRAM". APUs don't have dedicated VRAM, > > > > > > > > >> they > > > > > > > > >> use a > > > > > > > > >> reserved > > > > > > > > >> carve out region at the top of system memory. For CPU > > > > > > > > >> access > > > > > > > > >> to > > > > > > > > >> this > > > > > > > > >> memory, we kmap the physical address of the carve out > > > > > > > > >> region > > > > > > > > >> of > > > > > > > > >> system > > > > > > > > >> memory. You'll need to make sure that region is > > > > > > > > >> accessible to > > > > > > > > >> the > > > > > > > > >> guest. > > > > > > > > > So basically, the non-virt flow is is: (video?) BIOS > > > > > > > > > reserves > > > > > > > > > memory, marks it > > > > > > > > > as reserved in e820, stores the physaddr somewhere, > > > > > > > > > which > > > > > > > > > the > > > > > > > > > GPU > > > > > > > > > driver gets. > > > > > > > > > Since I suppose this includes the framebuffer, this > > > > > > > > > probably > > > > > > > > > has to > > > > > > > > > occur around > > > > > > > > > the moment the driver calls > > > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > > > > > > (which happens before this hw init step), right ? > > > > > > > > > > > > > > > > Well, that partially correct. The efifb is using the PCIe > > > > > > > > resources > > > > > > > > to > > > > > > > > access the framebuffer and as far as I know we use that > > > > > > > > one > > > > > > > > to > > > > > > > > kick > > > > > > > > it out. > > > > > > > > > > > > > > > > The stolen memory we get over e820/registers is separate > > > > > > > > to > > > > > > > > that. > > > > > > > > > > How is the stolen memory communicated to the driver ? That > > > > > host > > > > > physical > > > > > memory probably has to be mapped at the same guest physical > > > > > address > > > > > for > > > > > the magic to work, right ? > > > > > > > > Correct. The driver reads the physical location of that memory > > > > from > > > > hardware registers. Removing this chunk of code from gmc_v9_0.c > > > > will > > > > force the driver to use the BAR, but I'm not sure if there are > > > > any > > > > other places in the driver that make assumptions about using the > > > > physical host address or not on APUs off hand. > > > > > > > > if ((adev->flags & AMD_IS_APU) || > > > > (adev->gmc.xgmi.supported && > > > > adev->gmc.xgmi.connected_to_cpu)) { > > > > adev->gmc.aper_base = > > > > adev->gfxhub.funcs->get_mc_fb_offset(adev) > > > > + > > > > adev->gmc.xgmi.physical_node_id * > > > > adev->gmc.xgmi.node_segment_size; > > > > adev->gmc.aper_size = adev->gmc.real_vram_size; > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ... which brings me to a point that's been puzzling me > > > > > > > > > for > > > > > > > > > some > > > > > > > > > time, which is > > > > > > > > > that as the hw init fails, the efifb driver is still > > > > > > > > > using > > > > > > > > > the > > > > > > > > > framebuffer. > > > > > > > > > > > > > > > > No, it isn't. You are probably just still seeing the same > > > > > > > > screen. > > > > > > > > > > > > > > > > The issue is most likely that while efi was kicked out > > > > > > > > nobody > > > > > > > > re-programmed the display hardware to show something > > > > > > > > different. > > > > > > > > > > > > > > > > > Am I right in suspecting that efifb should get stripped > > > > > > > > > of > > > > > > > > > its > > > > > > > > > ownership of the > > > > > > > > > fb aperture first, and that if I don't get a black > > > > > > > > > screen > > > > > > > > > on > > > > > > > > > hw_init failure > > > > > > > > > that issue should be the first focus point ? > > > > > > > > > > > > > > > > You assumption with the black screen is incorrect. Since > > > > > > > > the > > > > > > > > hardware > > > > > > > > works independent even if you
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Alex wrote: > On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson wrote: > > > > > > > > - Mail original - > > > De: "Alex Deucher" > > > À: "Yann Dirson" > > > Cc: "Christian König" , > > > "amd-gfx list" > > > Envoyé: Mardi 21 Décembre 2021 23:31:01 > > > Objet: Re: Various problems trying to vga-passthrough a Renoir > > > iGPU to a xen/qubes-os hvm > > > > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson > > > wrote: > > > > > > > > > > > > Alex wrote: > > > > > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson > > > > > > > > > > wrote: > > > > > > > > > > > > Christian wrote: > > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > > > > > Alex wrote: > > > > > > > >> Thinking about this more, I think the problem might be > > > > > > > >> related > > > > > > > >> to > > > > > > > >> CPU > > > > > > > >> access to "VRAM". APUs don't have dedicated VRAM, > > > > > > > >> they > > > > > > > >> use a > > > > > > > >> reserved > > > > > > > >> carve out region at the top of system memory. For CPU > > > > > > > >> access > > > > > > > >> to > > > > > > > >> this > > > > > > > >> memory, we kmap the physical address of the carve out > > > > > > > >> region > > > > > > > >> of > > > > > > > >> system > > > > > > > >> memory. You'll need to make sure that region is > > > > > > > >> accessible to > > > > > > > >> the > > > > > > > >> guest. > > > > > > > > So basically, the non-virt flow is is: (video?) BIOS > > > > > > > > reserves > > > > > > > > memory, marks it > > > > > > > > as reserved in e820, stores the physaddr somewhere, > > > > > > > > which > > > > > > > > the > > > > > > > > GPU > > > > > > > > driver gets. > > > > > > > > Since I suppose this includes the framebuffer, this > > > > > > > > probably > > > > > > > > has to > > > > > > > > occur around > > > > > > > > the moment the driver calls > > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > > > > > (which happens before this hw init step), right ? > > > > > > > > > > > > > > Well, that partially correct. The efifb is using the PCIe > > > > > > > resources > > > > > > > to > > > > > > > access the framebuffer and as far as I know we use that > > > > > > > one > > > > > > > to > > > > > > > kick > > > > > > > it out. > > > > > > > > > > > > > > The stolen memory we get over e820/registers is separate > > > > > > > to > > > > > > > that. > > > > > > > > How is the stolen memory communicated to the driver ? That > > > > host > > > > physical > > > > memory probably has to be mapped at the same guest physical > > > > address > > > > for > > > > the magic to work, right ? > > > > > > Correct. The driver reads the physical location of that memory > > > from > > > hardware registers. Removing this chunk of code from gmc_v9_0.c > > > will > > > force the driver to use the BAR, but I'm not sure if there are > > > any > > > other places in the driver that make assumptions about using the > > > physical host address or not on APUs off hand. > > > > > > if ((adev->flags & AMD_IS_APU) || > > > (adev->gmc.xgmi.supported && > > > adev->gmc.xgmi.connected_to_cpu)) { > > > adev->gmc.aper_base = > > > adev->gfxhub.funcs->get_mc_fb_offset(adev) > > > + > > > adev->gmc.xgmi.physical_node_id * > > > adev->gmc.xgmi.node_segment_size; > > > adev->gmc.aper_size = adev->gmc.real_vram_size; > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > ... which brings me to a point that's been puzzling me > > > > > > > > for > > > > > > > > some > > > > > > > > time, which is > > > > > > > > that as the hw init fails, the efifb driver is still > > > > > > > > using > > > > > > > > the > > > > > > > > framebuffer. > > > > > > > > > > > > > > No, it isn't. You are probably just still seeing the same > > > > > > > screen. > > > > > > > > > > > > > > The issue is most likely that while efi was kicked out > > > > > > > nobody > > > > > > > re-programmed the display hardware to show something > > > > > > > different. > > > > > > > > > > > > > > > Am I right in suspecting that efifb should get stripped > > > > > > > > of > > > > > > > > its > > > > > > > > ownership of the > > > > > > > > fb aperture first, and that if I don't get a black > > > > > > > > screen > > > > > > > > on > > > > > > > > hw_init failure > > > > > > > > that issue should be the first focus point ? > > > > > > > > > > > > > > You assumption with the black screen is incorrect. Since > > > > > > > the > > > > > > > hardware > > > > > > > works independent even if you kick out efi you still have > > > > > > > the > > > > > > > same > > > > > > > screen content, you just can't update it anymore. > > > > > > > > > > > > It's not only that the screen keeps its contents, it's that > > > > > > the > > > > > > dom0 > > > > > > happily continues updating it. > > > > > > > > > > If the hypevisor is
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson wrote: > > > > - Mail original - > > De: "Alex Deucher" > > À: "Yann Dirson" > > Cc: "Christian König" , "amd-gfx list" > > > > Envoyé: Mardi 21 Décembre 2021 23:31:01 > > Objet: Re: Various problems trying to vga-passthrough a Renoir iGPU to a > > xen/qubes-os hvm > > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson wrote: > > > > > > > > > Alex wrote: > > > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson > > > > wrote: > > > > > > > > > > Christian wrote: > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > > > > Alex wrote: > > > > > > >> Thinking about this more, I think the problem might be > > > > > > >> related > > > > > > >> to > > > > > > >> CPU > > > > > > >> access to "VRAM". APUs don't have dedicated VRAM, they > > > > > > >> use a > > > > > > >> reserved > > > > > > >> carve out region at the top of system memory. For CPU > > > > > > >> access > > > > > > >> to > > > > > > >> this > > > > > > >> memory, we kmap the physical address of the carve out > > > > > > >> region > > > > > > >> of > > > > > > >> system > > > > > > >> memory. You'll need to make sure that region is > > > > > > >> accessible to > > > > > > >> the > > > > > > >> guest. > > > > > > > So basically, the non-virt flow is is: (video?) BIOS > > > > > > > reserves > > > > > > > memory, marks it > > > > > > > as reserved in e820, stores the physaddr somewhere, which > > > > > > > the > > > > > > > GPU > > > > > > > driver gets. > > > > > > > Since I suppose this includes the framebuffer, this > > > > > > > probably > > > > > > > has to > > > > > > > occur around > > > > > > > the moment the driver calls > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > > > > (which happens before this hw init step), right ? > > > > > > > > > > > > Well, that partially correct. The efifb is using the PCIe > > > > > > resources > > > > > > to > > > > > > access the framebuffer and as far as I know we use that one > > > > > > to > > > > > > kick > > > > > > it out. > > > > > > > > > > > > The stolen memory we get over e820/registers is separate to > > > > > > that. > > > > > > How is the stolen memory communicated to the driver ? That host > > > physical > > > memory probably has to be mapped at the same guest physical address > > > for > > > the magic to work, right ? > > > > Correct. The driver reads the physical location of that memory from > > hardware registers. Removing this chunk of code from gmc_v9_0.c will > > force the driver to use the BAR, but I'm not sure if there are any > > other places in the driver that make assumptions about using the > > physical host address or not on APUs off hand. > > > > if ((adev->flags & AMD_IS_APU) || > > (adev->gmc.xgmi.supported && > > adev->gmc.xgmi.connected_to_cpu)) { > > adev->gmc.aper_base = > > adev->gfxhub.funcs->get_mc_fb_offset(adev) + > > adev->gmc.xgmi.physical_node_id * > > adev->gmc.xgmi.node_segment_size; > > adev->gmc.aper_size = adev->gmc.real_vram_size; > > } > > > > > > > > > > > > > > > > > > > > > > ... which brings me to a point that's been puzzling me for > > > > > > > some > > > > > > > time, which is > > > > > > > that as the hw init fails, the efifb driver is still using > > > > > > > the > > > > > > > framebuffer. > > > > > > > > > > > > No, it isn't. You are probably just still seeing the same > > > > > > screen. > > > > > > > > > > > > The issue is most likely that while efi was kicked out nobody > > > > > > re-programmed the display hardware to show something > > > > > > different. > > > > > > > > > > > > > Am I right in suspecting that efifb should get stripped of > > > > > > > its > > > > > > > ownership of the > > > > > > > fb aperture first, and that if I don't get a black screen > > > > > > > on > > > > > > > hw_init failure > > > > > > > that issue should be the first focus point ? > > > > > > > > > > > > You assumption with the black screen is incorrect. Since the > > > > > > hardware > > > > > > works independent even if you kick out efi you still have the > > > > > > same > > > > > > screen content, you just can't update it anymore. > > > > > > > > > > It's not only that the screen keeps its contents, it's that the > > > > > dom0 > > > > > happily continues updating it. > > > > > > > > If the hypevisor is using efifb, then yes that could be a problem > > > > as > > > > the hypervisor could be writing to the efifb resources which ends > > > > up > > > > writing to the same physical memory. That applies to any GPU on > > > > a > > > > UEFI system. You'll need to make sure efifb is not in use in the > > > > hypervisor. > > > > > > That remark evokes several things to me. First one is that every > > > time > > > I've tried booting with efifb disabled in dom0, there was no > > > visible > > > improvements in the guest driver
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
- Mail original - > De: "Alex Deucher" > À: "Yann Dirson" > Cc: "Christian König" , "amd-gfx list" > > Envoyé: Mardi 21 Décembre 2021 23:31:01 > Objet: Re: Various problems trying to vga-passthrough a Renoir iGPU to a > xen/qubes-os hvm > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson wrote: > > > > > > Alex wrote: > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson > > > wrote: > > > > > > > > Christian wrote: > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > > > Alex wrote: > > > > > >> Thinking about this more, I think the problem might be > > > > > >> related > > > > > >> to > > > > > >> CPU > > > > > >> access to "VRAM". APUs don't have dedicated VRAM, they > > > > > >> use a > > > > > >> reserved > > > > > >> carve out region at the top of system memory. For CPU > > > > > >> access > > > > > >> to > > > > > >> this > > > > > >> memory, we kmap the physical address of the carve out > > > > > >> region > > > > > >> of > > > > > >> system > > > > > >> memory. You'll need to make sure that region is > > > > > >> accessible to > > > > > >> the > > > > > >> guest. > > > > > > So basically, the non-virt flow is is: (video?) BIOS > > > > > > reserves > > > > > > memory, marks it > > > > > > as reserved in e820, stores the physaddr somewhere, which > > > > > > the > > > > > > GPU > > > > > > driver gets. > > > > > > Since I suppose this includes the framebuffer, this > > > > > > probably > > > > > > has to > > > > > > occur around > > > > > > the moment the driver calls > > > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > > > (which happens before this hw init step), right ? > > > > > > > > > > Well, that partially correct. The efifb is using the PCIe > > > > > resources > > > > > to > > > > > access the framebuffer and as far as I know we use that one > > > > > to > > > > > kick > > > > > it out. > > > > > > > > > > The stolen memory we get over e820/registers is separate to > > > > > that. > > > > How is the stolen memory communicated to the driver ? That host > > physical > > memory probably has to be mapped at the same guest physical address > > for > > the magic to work, right ? > > Correct. The driver reads the physical location of that memory from > hardware registers. Removing this chunk of code from gmc_v9_0.c will > force the driver to use the BAR, but I'm not sure if there are any > other places in the driver that make assumptions about using the > physical host address or not on APUs off hand. > > if ((adev->flags & AMD_IS_APU) || > (adev->gmc.xgmi.supported && > adev->gmc.xgmi.connected_to_cpu)) { > adev->gmc.aper_base = > adev->gfxhub.funcs->get_mc_fb_offset(adev) + > adev->gmc.xgmi.physical_node_id * > adev->gmc.xgmi.node_segment_size; > adev->gmc.aper_size = adev->gmc.real_vram_size; > } > > > > > > > > > > > > > > > > ... which brings me to a point that's been puzzling me for > > > > > > some > > > > > > time, which is > > > > > > that as the hw init fails, the efifb driver is still using > > > > > > the > > > > > > framebuffer. > > > > > > > > > > No, it isn't. You are probably just still seeing the same > > > > > screen. > > > > > > > > > > The issue is most likely that while efi was kicked out nobody > > > > > re-programmed the display hardware to show something > > > > > different. > > > > > > > > > > > Am I right in suspecting that efifb should get stripped of > > > > > > its > > > > > > ownership of the > > > > > > fb aperture first, and that if I don't get a black screen > > > > > > on > > > > > > hw_init failure > > > > > > that issue should be the first focus point ? > > > > > > > > > > You assumption with the black screen is incorrect. Since the > > > > > hardware > > > > > works independent even if you kick out efi you still have the > > > > > same > > > > > screen content, you just can't update it anymore. > > > > > > > > It's not only that the screen keeps its contents, it's that the > > > > dom0 > > > > happily continues updating it. > > > > > > If the hypevisor is using efifb, then yes that could be a problem > > > as > > > the hypervisor could be writing to the efifb resources which ends > > > up > > > writing to the same physical memory. That applies to any GPU on > > > a > > > UEFI system. You'll need to make sure efifb is not in use in the > > > hypervisor. > > > > That remark evokes several things to me. First one is that every > > time > > I've tried booting with efifb disabled in dom0, there was no > > visible > > improvements in the guest driver - i.i. I really have to dig how > > vram mapping > > is performed and check things are as expected anyway. > > Ultimately you end up at the same physical memory. efifb uses the > PCI > BAR which points to the same physical memory that the driver directly > maps. > > > > > The other is that, when dom0 cannot use efifb,
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson wrote: > > > Alex wrote: > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson wrote: > > > > > > Christian wrote: > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > > Alex wrote: > > > > >> Thinking about this more, I think the problem might be related > > > > >> to > > > > >> CPU > > > > >> access to "VRAM". APUs don't have dedicated VRAM, they use a > > > > >> reserved > > > > >> carve out region at the top of system memory. For CPU access > > > > >> to > > > > >> this > > > > >> memory, we kmap the physical address of the carve out region > > > > >> of > > > > >> system > > > > >> memory. You'll need to make sure that region is accessible to > > > > >> the > > > > >> guest. > > > > > So basically, the non-virt flow is is: (video?) BIOS reserves > > > > > memory, marks it > > > > > as reserved in e820, stores the physaddr somewhere, which the > > > > > GPU > > > > > driver gets. > > > > > Since I suppose this includes the framebuffer, this probably > > > > > has to > > > > > occur around > > > > > the moment the driver calls > > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > > (which happens before this hw init step), right ? > > > > > > > > Well, that partially correct. The efifb is using the PCIe > > > > resources > > > > to > > > > access the framebuffer and as far as I know we use that one to > > > > kick > > > > it out. > > > > > > > > The stolen memory we get over e820/registers is separate to that. > > How is the stolen memory communicated to the driver ? That host physical > memory probably has to be mapped at the same guest physical address for > the magic to work, right ? Correct. The driver reads the physical location of that memory from hardware registers. Removing this chunk of code from gmc_v9_0.c will force the driver to use the BAR, but I'm not sure if there are any other places in the driver that make assumptions about using the physical host address or not on APUs off hand. if ((adev->flags & AMD_IS_APU) || (adev->gmc.xgmi.supported && adev->gmc.xgmi.connected_to_cpu)) { adev->gmc.aper_base = adev->gfxhub.funcs->get_mc_fb_offset(adev) + adev->gmc.xgmi.physical_node_id * adev->gmc.xgmi.node_segment_size; adev->gmc.aper_size = adev->gmc.real_vram_size; } > > > > > > > > > > ... which brings me to a point that's been puzzling me for some > > > > > time, which is > > > > > that as the hw init fails, the efifb driver is still using the > > > > > framebuffer. > > > > > > > > No, it isn't. You are probably just still seeing the same screen. > > > > > > > > The issue is most likely that while efi was kicked out nobody > > > > re-programmed the display hardware to show something different. > > > > > > > > > Am I right in suspecting that efifb should get stripped of its > > > > > ownership of the > > > > > fb aperture first, and that if I don't get a black screen on > > > > > hw_init failure > > > > > that issue should be the first focus point ? > > > > > > > > You assumption with the black screen is incorrect. Since the > > > > hardware > > > > works independent even if you kick out efi you still have the > > > > same > > > > screen content, you just can't update it anymore. > > > > > > It's not only that the screen keeps its contents, it's that the > > > dom0 > > > happily continues updating it. > > > > If the hypevisor is using efifb, then yes that could be a problem as > > the hypervisor could be writing to the efifb resources which ends up > > writing to the same physical memory. That applies to any GPU on a > > UEFI system. You'll need to make sure efifb is not in use in the > > hypervisor. > > That remark evokes several things to me. First one is that every time > I've tried booting with efifb disabled in dom0, there was no visible > improvements in the guest driver - i.i. I really have to dig how vram mapping > is performed and check things are as expected anyway. Ultimately you end up at the same physical memory. efifb uses the PCI BAR which points to the same physical memory that the driver directly maps. > > The other is that, when dom0 cannot use efifb, entering a luks key is > suddenly less user-friendly. But in theory I'd think we could overcome > this by letting dom0 use efifb until ready to start the guest, a simple > driver unbind at the right moment should be expected to work, right ? > Going further and allowing the guest to use efifb on its own could > possibly be more tricky (starting with a different state?) but does > not seem to sound completely outlandish either - or does it ? > efifb just takes whatever hardware state the GOP driver in the pre-OS environment left the GPU in. Once you have a driver loaded in the OS, that state is gone so I I don't see much value in using efifb once you have a real driver in the mix. If you want a console
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Alex wrote: > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson wrote: > > > > Christian wrote: > > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > > Alex wrote: > > > >> Thinking about this more, I think the problem might be related > > > >> to > > > >> CPU > > > >> access to "VRAM". APUs don't have dedicated VRAM, they use a > > > >> reserved > > > >> carve out region at the top of system memory. For CPU access > > > >> to > > > >> this > > > >> memory, we kmap the physical address of the carve out region > > > >> of > > > >> system > > > >> memory. You'll need to make sure that region is accessible to > > > >> the > > > >> guest. > > > > So basically, the non-virt flow is is: (video?) BIOS reserves > > > > memory, marks it > > > > as reserved in e820, stores the physaddr somewhere, which the > > > > GPU > > > > driver gets. > > > > Since I suppose this includes the framebuffer, this probably > > > > has to > > > > occur around > > > > the moment the driver calls > > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > > (which happens before this hw init step), right ? > > > > > > Well, that partially correct. The efifb is using the PCIe > > > resources > > > to > > > access the framebuffer and as far as I know we use that one to > > > kick > > > it out. > > > > > > The stolen memory we get over e820/registers is separate to that. How is the stolen memory communicated to the driver ? That host physical memory probably has to be mapped at the same guest physical address for the magic to work, right ? > > > > > > > ... which brings me to a point that's been puzzling me for some > > > > time, which is > > > > that as the hw init fails, the efifb driver is still using the > > > > framebuffer. > > > > > > No, it isn't. You are probably just still seeing the same screen. > > > > > > The issue is most likely that while efi was kicked out nobody > > > re-programmed the display hardware to show something different. > > > > > > > Am I right in suspecting that efifb should get stripped of its > > > > ownership of the > > > > fb aperture first, and that if I don't get a black screen on > > > > hw_init failure > > > > that issue should be the first focus point ? > > > > > > You assumption with the black screen is incorrect. Since the > > > hardware > > > works independent even if you kick out efi you still have the > > > same > > > screen content, you just can't update it anymore. > > > > It's not only that the screen keeps its contents, it's that the > > dom0 > > happily continues updating it. > > If the hypevisor is using efifb, then yes that could be a problem as > the hypervisor could be writing to the efifb resources which ends up > writing to the same physical memory. That applies to any GPU on a > UEFI system. You'll need to make sure efifb is not in use in the > hypervisor. That remark evokes several things to me. First one is that every time I've tried booting with efifb disabled in dom0, there was no visible improvements in the guest driver - i.i. I really have to dig how vram mapping is performed and check things are as expected anyway. The other is that, when dom0 cannot use efifb, entering a luks key is suddenly less user-friendly. But in theory I'd think we could overcome this by letting dom0 use efifb until ready to start the guest, a simple driver unbind at the right moment should be expected to work, right ? Going further and allowing the guest to use efifb on its own could possibly be more tricky (starting with a different state?) but does not seem to sound completely outlandish either - or does it ? > > Alex > > > > > > > But putting efi asside what Alex pointed out pretty much breaks > > > your > > > neck trying to forward the device. You maybe could try to hack > > > the > > > driver to use the PCIe BAR for framebuffer access, but that might > > > be > > > quite a bit slower. > > > > > > Regards, > > > Christian. > > > > > > > > > > >> Alex > > > >> > > > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher > > > >> > > > >> wrote: > > > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson > > > >>> wrote: > > > Alex wrote: > > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson > > > > > > > > wrote: > > > >> Hi Alex, > > > >> > > > >>> We have not validated virtualization of our integrated > > > >>> GPUs. I > > > >>> don't > > > >>> know that it will work at all. We had done a bit of > > > >>> testing but > > > >>> ran > > > >>> into the same issues with the PSP, but never had a chance > > > >>> to > > > >>> debug > > > >>> further because this feature is not productized. > > > >> ... > > > >>> You need a functional PSP to get the GPU driver up and > > > >>> running. > > > >> Ah, thanks for the hint :) > > > >> > > > >> I guess that if I want to have any chance to get the PSP > > > >> working > > > >> I'm > > > >> going to need more details on it. A quick search some > > >
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson wrote: > > Christian wrote: > > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > > Alex wrote: > > >> Thinking about this more, I think the problem might be related to > > >> CPU > > >> access to "VRAM". APUs don't have dedicated VRAM, they use a > > >> reserved > > >> carve out region at the top of system memory. For CPU access to > > >> this > > >> memory, we kmap the physical address of the carve out region of > > >> system > > >> memory. You'll need to make sure that region is accessible to the > > >> guest. > > > So basically, the non-virt flow is is: (video?) BIOS reserves > > > memory, marks it > > > as reserved in e820, stores the physaddr somewhere, which the GPU > > > driver gets. > > > Since I suppose this includes the framebuffer, this probably has to > > > occur around > > > the moment the driver calls > > > drm_aperture_remove_conflicting_pci_framebuffers() > > > (which happens before this hw init step), right ? > > > > Well, that partially correct. The efifb is using the PCIe resources > > to > > access the framebuffer and as far as I know we use that one to kick > > it out. > > > > The stolen memory we get over e820/registers is separate to that. > > > > > ... which brings me to a point that's been puzzling me for some > > > time, which is > > > that as the hw init fails, the efifb driver is still using the > > > framebuffer. > > > > No, it isn't. You are probably just still seeing the same screen. > > > > The issue is most likely that while efi was kicked out nobody > > re-programmed the display hardware to show something different. > > > > > Am I right in suspecting that efifb should get stripped of its > > > ownership of the > > > fb aperture first, and that if I don't get a black screen on > > > hw_init failure > > > that issue should be the first focus point ? > > > > You assumption with the black screen is incorrect. Since the hardware > > works independent even if you kick out efi you still have the same > > screen content, you just can't update it anymore. > > It's not only that the screen keeps its contents, it's that the dom0 > happily continues updating it. If the hypevisor is using efifb, then yes that could be a problem as the hypervisor could be writing to the efifb resources which ends up writing to the same physical memory. That applies to any GPU on a UEFI system. You'll need to make sure efifb is not in use in the hypervisor. Alex > > > But putting efi asside what Alex pointed out pretty much breaks your > > neck trying to forward the device. You maybe could try to hack the > > driver to use the PCIe BAR for framebuffer access, but that might be > > quite a bit slower. > > > > Regards, > > Christian. > > > > > > > >> Alex > > >> > > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher > > >> > > >> wrote: > > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson > > >>> wrote: > > Alex wrote: > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson > > > wrote: > > >> Hi Alex, > > >> > > >>> We have not validated virtualization of our integrated > > >>> GPUs. I > > >>> don't > > >>> know that it will work at all. We had done a bit of > > >>> testing but > > >>> ran > > >>> into the same issues with the PSP, but never had a chance > > >>> to > > >>> debug > > >>> further because this feature is not productized. > > >> ... > > >>> You need a functional PSP to get the GPU driver up and > > >>> running. > > >> Ah, thanks for the hint :) > > >> > > >> I guess that if I want to have any chance to get the PSP > > >> working > > >> I'm > > >> going to need more details on it. A quick search some time > > >> ago > > >> mostly > > >> brought reverse-engineering work, rather than official AMD > > >> doc. > > >> Are > > >> there some AMD resources I missed ? > > > The driver code is pretty much it. > > Let's try to shed some more light on how things work, taking as > > excuse > > psp_v12_0_ring_create(). > > > > First, register access through [RW]REG32_SOC15() is implemented > > in > > terms of __[RW]REG32_SOC15_RLC__(), which is basically a > > [RW]REG32(), > > except it has to be more complex in the SR-IOV case. > > Has the RLC anything to do with SR-IOV ? > > >>> When running the driver on a SR-IOV virtual function (VF), some > > >>> registers are not available directly via the VF's MMIO aperture > > >>> so > > >>> they need to go through the RLC. For bare metal or passthrough > > >>> this > > >>> is not relevant. > > >>> > > It accesses registers in the MMIO range of the MP0 IP, and the > > "MP0" > > name correlates highly with MMIO accesses in PSP-handling code. > > Is "MP0" another name for PSP (and "MP1" for SMU) ? The MP0 > > version > > >>> Yes. > > >>> > > reported at v11.0.3 by discovery seems to contradict the use of > >
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Christian wrote: > Am 19.12.21 um 17:00 schrieb Yann Dirson: > > Alex wrote: > >> Thinking about this more, I think the problem might be related to > >> CPU > >> access to "VRAM". APUs don't have dedicated VRAM, they use a > >> reserved > >> carve out region at the top of system memory. For CPU access to > >> this > >> memory, we kmap the physical address of the carve out region of > >> system > >> memory. You'll need to make sure that region is accessible to the > >> guest. > > So basically, the non-virt flow is is: (video?) BIOS reserves > > memory, marks it > > as reserved in e820, stores the physaddr somewhere, which the GPU > > driver gets. > > Since I suppose this includes the framebuffer, this probably has to > > occur around > > the moment the driver calls > > drm_aperture_remove_conflicting_pci_framebuffers() > > (which happens before this hw init step), right ? > > Well, that partially correct. The efifb is using the PCIe resources > to > access the framebuffer and as far as I know we use that one to kick > it out. > > The stolen memory we get over e820/registers is separate to that. > > > ... which brings me to a point that's been puzzling me for some > > time, which is > > that as the hw init fails, the efifb driver is still using the > > framebuffer. > > No, it isn't. You are probably just still seeing the same screen. > > The issue is most likely that while efi was kicked out nobody > re-programmed the display hardware to show something different. > > > Am I right in suspecting that efifb should get stripped of its > > ownership of the > > fb aperture first, and that if I don't get a black screen on > > hw_init failure > > that issue should be the first focus point ? > > You assumption with the black screen is incorrect. Since the hardware > works independent even if you kick out efi you still have the same > screen content, you just can't update it anymore. It's not only that the screen keeps its contents, it's that the dom0 happily continues updating it. > But putting efi asside what Alex pointed out pretty much breaks your > neck trying to forward the device. You maybe could try to hack the > driver to use the PCIe BAR for framebuffer access, but that might be > quite a bit slower. > > Regards, > Christian. > > > > >> Alex > >> > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher > >> > >> wrote: > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson > >>> wrote: > Alex wrote: > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson > > wrote: > >> Hi Alex, > >> > >>> We have not validated virtualization of our integrated > >>> GPUs. I > >>> don't > >>> know that it will work at all. We had done a bit of > >>> testing but > >>> ran > >>> into the same issues with the PSP, but never had a chance > >>> to > >>> debug > >>> further because this feature is not productized. > >> ... > >>> You need a functional PSP to get the GPU driver up and > >>> running. > >> Ah, thanks for the hint :) > >> > >> I guess that if I want to have any chance to get the PSP > >> working > >> I'm > >> going to need more details on it. A quick search some time > >> ago > >> mostly > >> brought reverse-engineering work, rather than official AMD > >> doc. > >> Are > >> there some AMD resources I missed ? > > The driver code is pretty much it. > Let's try to shed some more light on how things work, taking as > excuse > psp_v12_0_ring_create(). > > First, register access through [RW]REG32_SOC15() is implemented > in > terms of __[RW]REG32_SOC15_RLC__(), which is basically a > [RW]REG32(), > except it has to be more complex in the SR-IOV case. > Has the RLC anything to do with SR-IOV ? > >>> When running the driver on a SR-IOV virtual function (VF), some > >>> registers are not available directly via the VF's MMIO aperture > >>> so > >>> they need to go through the RLC. For bare metal or passthrough > >>> this > >>> is not relevant. > >>> > It accesses registers in the MMIO range of the MP0 IP, and the > "MP0" > name correlates highly with MMIO accesses in PSP-handling code. > Is "MP0" another name for PSP (and "MP1" for SMU) ? The MP0 > version > >>> Yes. > >>> > reported at v11.0.3 by discovery seems to contradict the use of > v12.0 > for RENOIR as set by soc15_set_ip_blocks(), or do I miss > something ? > >>> Typo in the ip discovery table on renoir. > >>> > More generally (and mostly out of curiosity while we're at it), > do we > have a way to match IPs listed at discovery time with the ones > used > in the driver ? > >>> In general, barring typos, the code is shared at the major > >>> version > >>> level. The actual code may or may not need changes to handle > >>> minor > >>> revision changes in an IP. The driver maps the IP versions from > >>> the > >>> ip discovery
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Am 19.12.21 um 17:00 schrieb Yann Dirson: Alex wrote: Thinking about this more, I think the problem might be related to CPU access to "VRAM". APUs don't have dedicated VRAM, they use a reserved carve out region at the top of system memory. For CPU access to this memory, we kmap the physical address of the carve out region of system memory. You'll need to make sure that region is accessible to the guest. So basically, the non-virt flow is is: (video?) BIOS reserves memory, marks it as reserved in e820, stores the physaddr somewhere, which the GPU driver gets. Since I suppose this includes the framebuffer, this probably has to occur around the moment the driver calls drm_aperture_remove_conflicting_pci_framebuffers() (which happens before this hw init step), right ? Well, that partially correct. The efifb is using the PCIe resources to access the framebuffer and as far as I know we use that one to kick it out. The stolen memory we get over e820/registers is separate to that. ... which brings me to a point that's been puzzling me for some time, which is that as the hw init fails, the efifb driver is still using the framebuffer. No, it isn't. You are probably just still seeing the same screen. The issue is most likely that while efi was kicked out nobody re-programmed the display hardware to show something different. Am I right in suspecting that efifb should get stripped of its ownership of the fb aperture first, and that if I don't get a black screen on hw_init failure that issue should be the first focus point ? You assumption with the black screen is incorrect. Since the hardware works independent even if you kick out efi you still have the same screen content, you just can't update it anymore. But putting efi asside what Alex pointed out pretty much breaks your neck trying to forward the device. You maybe could try to hack the driver to use the PCIe BAR for framebuffer access, but that might be quite a bit slower. Regards, Christian. Alex On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher wrote: On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson wrote: Alex wrote: On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson wrote: Hi Alex, We have not validated virtualization of our integrated GPUs. I don't know that it will work at all. We had done a bit of testing but ran into the same issues with the PSP, but never had a chance to debug further because this feature is not productized. ... You need a functional PSP to get the GPU driver up and running. Ah, thanks for the hint :) I guess that if I want to have any chance to get the PSP working I'm going to need more details on it. A quick search some time ago mostly brought reverse-engineering work, rather than official AMD doc. Are there some AMD resources I missed ? The driver code is pretty much it. Let's try to shed some more light on how things work, taking as excuse psp_v12_0_ring_create(). First, register access through [RW]REG32_SOC15() is implemented in terms of __[RW]REG32_SOC15_RLC__(), which is basically a [RW]REG32(), except it has to be more complex in the SR-IOV case. Has the RLC anything to do with SR-IOV ? When running the driver on a SR-IOV virtual function (VF), some registers are not available directly via the VF's MMIO aperture so they need to go through the RLC. For bare metal or passthrough this is not relevant. It accesses registers in the MMIO range of the MP0 IP, and the "MP0" name correlates highly with MMIO accesses in PSP-handling code. Is "MP0" another name for PSP (and "MP1" for SMU) ? The MP0 version Yes. reported at v11.0.3 by discovery seems to contradict the use of v12.0 for RENOIR as set by soc15_set_ip_blocks(), or do I miss something ? Typo in the ip discovery table on renoir. More generally (and mostly out of curiosity while we're at it), do we have a way to match IPs listed at discovery time with the ones used in the driver ? In general, barring typos, the code is shared at the major version level. The actual code may or may not need changes to handle minor revision changes in an IP. The driver maps the IP versions from the ip discovery table to the code contained in the driver. --- As for the register names, maybe we could have a short explanation of how they are structured ? Eg. mmMP0_SMN_C2PMSG_69: that seems to be a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not sure of the "SMN" part -- that could refer to the "System Management Network", described in [0] as an internal bus. Are we accessing this register through this SMN ? These registers are just mailboxes for the PSP firmware. All of the C2PMSG registers functionality is defined by the PSP firmware. They are basically scratch registers used to communicate between the driver and the PSP firmware. On APUs, the PSP is shared with the CPU and the rest of the platform. The GPU driver just interacts with it for a few specific tasks: 1. Loading Trusted Applications (e.g., trusted firmware
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Alex wrote: > Thinking about this more, I think the problem might be related to CPU > access to "VRAM". APUs don't have dedicated VRAM, they use a > reserved > carve out region at the top of system memory. For CPU access to this > memory, we kmap the physical address of the carve out region of > system > memory. You'll need to make sure that region is accessible to the > guest. So basically, the non-virt flow is is: (video?) BIOS reserves memory, marks it as reserved in e820, stores the physaddr somewhere, which the GPU driver gets. Since I suppose this includes the framebuffer, this probably has to occur around the moment the driver calls drm_aperture_remove_conflicting_pci_framebuffers() (which happens before this hw init step), right ? ... which brings me to a point that's been puzzling me for some time, which is that as the hw init fails, the efifb driver is still using the framebuffer. Am I right in suspecting that efifb should get stripped of its ownership of the fb aperture first, and that if I don't get a black screen on hw_init failure that issue should be the first focus point ? > > Alex > > On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher > wrote: > > > > On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson > > wrote: > > > > > > Alex wrote: > > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson > > > > wrote: > > > > > > > > > > Hi Alex, > > > > > > > > > > > We have not validated virtualization of our integrated > > > > > > GPUs. I > > > > > > don't > > > > > > know that it will work at all. We had done a bit of > > > > > > testing but > > > > > > ran > > > > > > into the same issues with the PSP, but never had a chance > > > > > > to > > > > > > debug > > > > > > further because this feature is not productized. > > > > > ... > > > > > > You need a functional PSP to get the GPU driver up and > > > > > > running. > > > > > > > > > > Ah, thanks for the hint :) > > > > > > > > > > I guess that if I want to have any chance to get the PSP > > > > > working > > > > > I'm > > > > > going to need more details on it. A quick search some time > > > > > ago > > > > > mostly > > > > > brought reverse-engineering work, rather than official AMD > > > > > doc. > > > > > Are > > > > > there some AMD resources I missed ? > > > > > > > > The driver code is pretty much it. > > > > > > Let's try to shed some more light on how things work, taking as > > > excuse > > > psp_v12_0_ring_create(). > > > > > > First, register access through [RW]REG32_SOC15() is implemented > > > in > > > terms of __[RW]REG32_SOC15_RLC__(), which is basically a > > > [RW]REG32(), > > > except it has to be more complex in the SR-IOV case. > > > Has the RLC anything to do with SR-IOV ? > > > > When running the driver on a SR-IOV virtual function (VF), some > > registers are not available directly via the VF's MMIO aperture so > > they need to go through the RLC. For bare metal or passthrough > > this > > is not relevant. > > > > > > > > It accesses registers in the MMIO range of the MP0 IP, and the > > > "MP0" > > > name correlates highly with MMIO accesses in PSP-handling code. > > > Is "MP0" another name for PSP (and "MP1" for SMU) ? The MP0 > > > version > > > > Yes. > > > > > reported at v11.0.3 by discovery seems to contradict the use of > > > v12.0 > > > for RENOIR as set by soc15_set_ip_blocks(), or do I miss > > > something ? > > > > Typo in the ip discovery table on renoir. > > > > > > > > More generally (and mostly out of curiosity while we're at it), > > > do we > > > have a way to match IPs listed at discovery time with the ones > > > used > > > in the driver ? > > > > In general, barring typos, the code is shared at the major version > > level. The actual code may or may not need changes to handle minor > > revision changes in an IP. The driver maps the IP versions from > > the > > ip discovery table to the code contained in the driver. > > > > > > > > --- > > > > > > As for the register names, maybe we could have a short > > > explanation of > > > how they are structured ? Eg. mmMP0_SMN_C2PMSG_69: that seems to > > > be > > > a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not > > > sure > > > of the "SMN" part -- that could refer to the "System Management > > > Network", > > > described in [0] as an internal bus. Are we accessing this > > > register > > > through this SMN ? > > > > These registers are just mailboxes for the PSP firmware. All of > > the > > C2PMSG registers functionality is defined by the PSP firmware. > > They > > are basically scratch registers used to communicate between the > > driver > > and the PSP firmware. > > > > > > > > > > > > On APUs, the PSP is shared with > > > > the CPU and the rest of the platform. The GPU driver just > > > > interacts > > > > with it for a few specific tasks: > > > > 1. Loading Trusted Applications (e.g., trusted firmware > > > > applications > > > > that run on the PSP for specific functionality, e.g., HDCP and > > > > content > > > > protection, etc.) >
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Thinking about this more, I think the problem might be related to CPU access to "VRAM". APUs don't have dedicated VRAM, they use a reserved carve out region at the top of system memory. For CPU access to this memory, we kmap the physical address of the carve out region of system memory. You'll need to make sure that region is accessible to the guest. Alex On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher wrote: > > On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson wrote: > > > > Alex wrote: > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson wrote: > > > > > > > > Hi Alex, > > > > > > > > > We have not validated virtualization of our integrated GPUs. I > > > > > don't > > > > > know that it will work at all. We had done a bit of testing but > > > > > ran > > > > > into the same issues with the PSP, but never had a chance to > > > > > debug > > > > > further because this feature is not productized. > > > > ... > > > > > You need a functional PSP to get the GPU driver up and running. > > > > > > > > Ah, thanks for the hint :) > > > > > > > > I guess that if I want to have any chance to get the PSP working > > > > I'm > > > > going to need more details on it. A quick search some time ago > > > > mostly > > > > brought reverse-engineering work, rather than official AMD doc. > > > > Are > > > > there some AMD resources I missed ? > > > > > > The driver code is pretty much it. > > > > Let's try to shed some more light on how things work, taking as excuse > > psp_v12_0_ring_create(). > > > > First, register access through [RW]REG32_SOC15() is implemented in > > terms of __[RW]REG32_SOC15_RLC__(), which is basically a [RW]REG32(), > > except it has to be more complex in the SR-IOV case. > > Has the RLC anything to do with SR-IOV ? > > When running the driver on a SR-IOV virtual function (VF), some > registers are not available directly via the VF's MMIO aperture so > they need to go through the RLC. For bare metal or passthrough this > is not relevant. > > > > > It accesses registers in the MMIO range of the MP0 IP, and the "MP0" > > name correlates highly with MMIO accesses in PSP-handling code. > > Is "MP0" another name for PSP (and "MP1" for SMU) ? The MP0 version > > Yes. > > > reported at v11.0.3 by discovery seems to contradict the use of v12.0 > > for RENOIR as set by soc15_set_ip_blocks(), or do I miss something ? > > Typo in the ip discovery table on renoir. > > > > > More generally (and mostly out of curiosity while we're at it), do we > > have a way to match IPs listed at discovery time with the ones used > > in the driver ? > > In general, barring typos, the code is shared at the major version > level. The actual code may or may not need changes to handle minor > revision changes in an IP. The driver maps the IP versions from the > ip discovery table to the code contained in the driver. > > > > > --- > > > > As for the register names, maybe we could have a short explanation of > > how they are structured ? Eg. mmMP0_SMN_C2PMSG_69: that seems to be > > a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not sure > > of the "SMN" part -- that could refer to the "System Management Network", > > described in [0] as an internal bus. Are we accessing this register > > through this SMN ? > > These registers are just mailboxes for the PSP firmware. All of the > C2PMSG registers functionality is defined by the PSP firmware. They > are basically scratch registers used to communicate between the driver > and the PSP firmware. > > > > > > > > On APUs, the PSP is shared with > > > the CPU and the rest of the platform. The GPU driver just interacts > > > with it for a few specific tasks: > > > 1. Loading Trusted Applications (e.g., trusted firmware applications > > > that run on the PSP for specific functionality, e.g., HDCP and > > > content > > > protection, etc.) > > > 2. Validating and loading firmware for other engines on the SoC. > > > This > > > is required to use those engines. > > > > Trying to understand in more details how we start the PSP up, I noticed > > that psp_v12_0 has support for loading a sOS firmware, but never calls > > init_sos_microcode() - and anyway there is no sos firmware for renoir > > and green_sardine, which seem to be the only ASICs with this PSP version. > > Is it something that's just not been completely wired up yet ? > > On APUs, the PSP is shared with the CPU so the PSP firmware is part of > the sbios image. The driver doesn't load it. We only load it on > dGPUs where the driver is responsible for the chip initialization. > > > > > That also rings a bell, that we have nothing about Secure OS in the doc > > yet (not even the acronym in the glossary). > > > > > > > I'm not too familiar with the PSP's path to memory from the GPU > > > perspective. IIRC, most memory used by the PSP goes through carve > > > out > > > "vram" on APUs so it should work, but I would double check if there > > > are any system memory allocations that used to interact with the PSP > > >
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson wrote: > > Alex wrote: > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson wrote: > > > > > > Hi Alex, > > > > > > > We have not validated virtualization of our integrated GPUs. I > > > > don't > > > > know that it will work at all. We had done a bit of testing but > > > > ran > > > > into the same issues with the PSP, but never had a chance to > > > > debug > > > > further because this feature is not productized. > > > ... > > > > You need a functional PSP to get the GPU driver up and running. > > > > > > Ah, thanks for the hint :) > > > > > > I guess that if I want to have any chance to get the PSP working > > > I'm > > > going to need more details on it. A quick search some time ago > > > mostly > > > brought reverse-engineering work, rather than official AMD doc. > > > Are > > > there some AMD resources I missed ? > > > > The driver code is pretty much it. > > Let's try to shed some more light on how things work, taking as excuse > psp_v12_0_ring_create(). > > First, register access through [RW]REG32_SOC15() is implemented in > terms of __[RW]REG32_SOC15_RLC__(), which is basically a [RW]REG32(), > except it has to be more complex in the SR-IOV case. > Has the RLC anything to do with SR-IOV ? When running the driver on a SR-IOV virtual function (VF), some registers are not available directly via the VF's MMIO aperture so they need to go through the RLC. For bare metal or passthrough this is not relevant. > > It accesses registers in the MMIO range of the MP0 IP, and the "MP0" > name correlates highly with MMIO accesses in PSP-handling code. > Is "MP0" another name for PSP (and "MP1" for SMU) ? The MP0 version Yes. > reported at v11.0.3 by discovery seems to contradict the use of v12.0 > for RENOIR as set by soc15_set_ip_blocks(), or do I miss something ? Typo in the ip discovery table on renoir. > > More generally (and mostly out of curiosity while we're at it), do we > have a way to match IPs listed at discovery time with the ones used > in the driver ? In general, barring typos, the code is shared at the major version level. The actual code may or may not need changes to handle minor revision changes in an IP. The driver maps the IP versions from the ip discovery table to the code contained in the driver. > > --- > > As for the register names, maybe we could have a short explanation of > how they are structured ? Eg. mmMP0_SMN_C2PMSG_69: that seems to be > a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not sure > of the "SMN" part -- that could refer to the "System Management Network", > described in [0] as an internal bus. Are we accessing this register > through this SMN ? These registers are just mailboxes for the PSP firmware. All of the C2PMSG registers functionality is defined by the PSP firmware. They are basically scratch registers used to communicate between the driver and the PSP firmware. > > > > On APUs, the PSP is shared with > > the CPU and the rest of the platform. The GPU driver just interacts > > with it for a few specific tasks: > > 1. Loading Trusted Applications (e.g., trusted firmware applications > > that run on the PSP for specific functionality, e.g., HDCP and > > content > > protection, etc.) > > 2. Validating and loading firmware for other engines on the SoC. > > This > > is required to use those engines. > > Trying to understand in more details how we start the PSP up, I noticed > that psp_v12_0 has support for loading a sOS firmware, but never calls > init_sos_microcode() - and anyway there is no sos firmware for renoir > and green_sardine, which seem to be the only ASICs with this PSP version. > Is it something that's just not been completely wired up yet ? On APUs, the PSP is shared with the CPU so the PSP firmware is part of the sbios image. The driver doesn't load it. We only load it on dGPUs where the driver is responsible for the chip initialization. > > That also rings a bell, that we have nothing about Secure OS in the doc > yet (not even the acronym in the glossary). > > > > I'm not too familiar with the PSP's path to memory from the GPU > > perspective. IIRC, most memory used by the PSP goes through carve > > out > > "vram" on APUs so it should work, but I would double check if there > > are any system memory allocations that used to interact with the PSP > > and see if changing them to vram helps. It does work with the IOMMU > > enabled on bare metal, so it should work in passthrough as well in > > theory. > > I can see a single case in the PSP code where GTT is used instead of > vram: to create fw_pri_bo when SR-IOV is not used (and there has > to be a reason, since the SR-IOV code path does use vram). > Changing it to vram does not make a difference, but then the > only bo that seems to be used at that point is the one for the psp ring, > which is allocated in vram, so I'm not too much surprised. > > Maybe I should double-check bo_create calls to hunt for more ? We looked
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Alex wrote: > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson wrote: > > > > Hi Alex, > > > > > We have not validated virtualization of our integrated GPUs. I > > > don't > > > know that it will work at all. We had done a bit of testing but > > > ran > > > into the same issues with the PSP, but never had a chance to > > > debug > > > further because this feature is not productized. > > ... > > > You need a functional PSP to get the GPU driver up and running. > > > > Ah, thanks for the hint :) > > > > I guess that if I want to have any chance to get the PSP working > > I'm > > going to need more details on it. A quick search some time ago > > mostly > > brought reverse-engineering work, rather than official AMD doc. > > Are > > there some AMD resources I missed ? > > The driver code is pretty much it. Let's try to shed some more light on how things work, taking as excuse psp_v12_0_ring_create(). First, register access through [RW]REG32_SOC15() is implemented in terms of __[RW]REG32_SOC15_RLC__(), which is basically a [RW]REG32(), except it has to be more complex in the SR-IOV case. Has the RLC anything to do with SR-IOV ? It accesses registers in the MMIO range of the MP0 IP, and the "MP0" name correlates highly with MMIO accesses in PSP-handling code. Is "MP0" another name for PSP (and "MP1" for SMU) ? The MP0 version reported at v11.0.3 by discovery seems to contradict the use of v12.0 for RENOIR as set by soc15_set_ip_blocks(), or do I miss something ? More generally (and mostly out of curiosity while we're at it), do we have a way to match IPs listed at discovery time with the ones used in the driver ? --- As for the register names, maybe we could have a short explanation of how they are structured ? Eg. mmMP0_SMN_C2PMSG_69: that seems to be a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not sure of the "SMN" part -- that could refer to the "System Management Network", described in [0] as an internal bus. Are we accessing this register through this SMN ? > On APUs, the PSP is shared with > the CPU and the rest of the platform. The GPU driver just interacts > with it for a few specific tasks: > 1. Loading Trusted Applications (e.g., trusted firmware applications > that run on the PSP for specific functionality, e.g., HDCP and > content > protection, etc.) > 2. Validating and loading firmware for other engines on the SoC. > This > is required to use those engines. Trying to understand in more details how we start the PSP up, I noticed that psp_v12_0 has support for loading a sOS firmware, but never calls init_sos_microcode() - and anyway there is no sos firmware for renoir and green_sardine, which seem to be the only ASICs with this PSP version. Is it something that's just not been completely wired up yet ? That also rings a bell, that we have nothing about Secure OS in the doc yet (not even the acronym in the glossary). > I'm not too familiar with the PSP's path to memory from the GPU > perspective. IIRC, most memory used by the PSP goes through carve > out > "vram" on APUs so it should work, but I would double check if there > are any system memory allocations that used to interact with the PSP > and see if changing them to vram helps. It does work with the IOMMU > enabled on bare metal, so it should work in passthrough as well in > theory. I can see a single case in the PSP code where GTT is used instead of vram: to create fw_pri_bo when SR-IOV is not used (and there has to be a reason, since the SR-IOV code path does use vram). Changing it to vram does not make a difference, but then the only bo that seems to be used at that point is the one for the psp ring, which is allocated in vram, so I'm not too much surprised. Maybe I should double-check bo_create calls to hunt for more ? [0] https://github.com/PSPReverse/psp-docs/blob/master/masterthesis-eichner-psp-2020.pdf
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Wed, Dec 8, 2021 at 5:50 PM Yann Dirson wrote: > > Hi Alex, > > > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson wrote: > > > > > > Hi Alex, > > > > > > > We have not validated virtualization of our integrated GPUs. I > > > > don't > > > > know that it will work at all. We had done a bit of testing but > > > > ran > > > > into the same issues with the PSP, but never had a chance to > > > > debug > > > > further because this feature is not productized. > > > ... > > > > You need a functional PSP to get the GPU driver up and running. > > > > > > Ah, thanks for the hint :) > > > > > > I guess that if I want to have any chance to get the PSP working > > > I'm > > > going to need more details on it. A quick search some time ago > > > mostly > > > brought reverse-engineering work, rather than official AMD doc. > > > Are > > > there some AMD resources I missed ? > > > > The driver code is pretty much it. On APUs, the PSP is shared with > > the CPU and the rest of the platform. The GPU driver just interacts > > with it for a few specific tasks: > > 1. Loading Trusted Applications (e.g., trusted firmware applications > > that run on the PSP for specific functionality, e.g., HDCP and > > content > > protection, etc.) > > 2. Validating and loading firmware for other engines on the SoC. > > This > > is required to use those engines. > > After some digging, if I understand correctly, the PSP is the 3rd IP > getting its hw_init() called. First comes soc15_common, then vega10_ih. > > - soc15_common_init_hw does some writes through nbio_v7.0 functions, > but does not query the hw to check before or after > - vega10_init_hw does some register reads as part of its work, but once > it has written it does not check either > > So PSP is the first one to check that "soc15" (I'm still not sure what > this one represents, really) is in fact alive and well. > > Can't we check earlier that the chip is really listening to us ? Each SoC is made up of hardware blocks that provide various different functionality. They are mostly independent and mostly initialized independently. I'm not sure what you would want to check. In your case, I don't think it's an issue of the chip not being functional overall, but rather a problem specific to the failing block somehow related to being in a virtualized environment. Alex > > > > > I'm not too familiar with the PSP's path to memory from the GPU > > perspective. IIRC, most memory used by the PSP goes through carve > > out > > "vram" on APUs so it should work, but I would double check if there > > are any system memory allocations that used to interact with the PSP > > and see if changing them to vram helps. It does work with the IOMMU > > enabled on bare metal, so it should work in passthrough as well in > > theory. > > > > Alex > > > > > > > > > > Best regards, > > > -- > > > Yann > >
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Hi Alex, > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson wrote: > > > > Hi Alex, > > > > > We have not validated virtualization of our integrated GPUs. I > > > don't > > > know that it will work at all. We had done a bit of testing but > > > ran > > > into the same issues with the PSP, but never had a chance to > > > debug > > > further because this feature is not productized. > > ... > > > You need a functional PSP to get the GPU driver up and running. > > > > Ah, thanks for the hint :) > > > > I guess that if I want to have any chance to get the PSP working > > I'm > > going to need more details on it. A quick search some time ago > > mostly > > brought reverse-engineering work, rather than official AMD doc. > > Are > > there some AMD resources I missed ? > > The driver code is pretty much it. On APUs, the PSP is shared with > the CPU and the rest of the platform. The GPU driver just interacts > with it for a few specific tasks: > 1. Loading Trusted Applications (e.g., trusted firmware applications > that run on the PSP for specific functionality, e.g., HDCP and > content > protection, etc.) > 2. Validating and loading firmware for other engines on the SoC. > This > is required to use those engines. After some digging, if I understand correctly, the PSP is the 3rd IP getting its hw_init() called. First comes soc15_common, then vega10_ih. - soc15_common_init_hw does some writes through nbio_v7.0 functions, but does not query the hw to check before or after - vega10_init_hw does some register reads as part of its work, but once it has written it does not check either So PSP is the first one to check that "soc15" (I'm still not sure what this one represents, really) is in fact alive and well. Can't we check earlier that the chip is really listening to us ? > > I'm not too familiar with the PSP's path to memory from the GPU > perspective. IIRC, most memory used by the PSP goes through carve > out > "vram" on APUs so it should work, but I would double check if there > are any system memory allocations that used to interact with the PSP > and see if changing them to vram helps. It does work with the IOMMU > enabled on bare metal, so it should work in passthrough as well in > theory. > > Alex > > > > > > Best regards, > > -- > > Yann >
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson wrote: > > Hi Alex, > > > We have not validated virtualization of our integrated GPUs. I don't > > know that it will work at all. We had done a bit of testing but ran > > into the same issues with the PSP, but never had a chance to debug > > further because this feature is not productized. > ... > > You need a functional PSP to get the GPU driver up and running. > > Ah, thanks for the hint :) > > I guess that if I want to have any chance to get the PSP working I'm > going to need more details on it. A quick search some time ago mostly > brought reverse-engineering work, rather than official AMD doc. Are > there some AMD resources I missed ? The driver code is pretty much it. On APUs, the PSP is shared with the CPU and the rest of the platform. The GPU driver just interacts with it for a few specific tasks: 1. Loading Trusted Applications (e.g., trusted firmware applications that run on the PSP for specific functionality, e.g., HDCP and content protection, etc.) 2. Validating and loading firmware for other engines on the SoC. This is required to use those engines. I'm not too familiar with the PSP's path to memory from the GPU perspective. IIRC, most memory used by the PSP goes through carve out "vram" on APUs so it should work, but I would double check if there are any system memory allocations that used to interact with the PSP and see if changing them to vram helps. It does work with the IOMMU enabled on bare metal, so it should work in passthrough as well in theory. Alex > > Best regards, > -- > Yann
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Hi Alex, > We have not validated virtualization of our integrated GPUs. I don't > know that it will work at all. We had done a bit of testing but ran > into the same issues with the PSP, but never had a chance to debug > further because this feature is not productized. ... > You need a functional PSP to get the GPU driver up and running. Ah, thanks for the hint :) I guess that if I want to have any chance to get the PSP working I'm going to need more details on it. A quick search some time ago mostly brought reverse-engineering work, rather than official AMD doc. Are there some AMD resources I missed ? Best regards, -- Yann
Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
On Sat, Nov 27, 2021 at 11:28 AM wrote: > > Hello, > > Xen passthrough of a boot GPU those days (at least in the small QubesOS world) > is mostly tested/documented for Intel iGPUs (or I missed something). > I've been trying to do that with a Renoir GPU (for context, the goal is > to have a xen domU dedicated to the GUI [3]). I won't go into all the details > of my successive attempts in this email, various (relative) progress reports > are > available at [0] (there are other things to be investigated listed there, but > at least some of them can possibly wait). And I have surely missed more than > a couple of key points. We have not validated virtualization of our integrated GPUs. I don't know that it will work at all. We had done a bit of testing but ran into the same issues with the PSP, but never had a chance to debug further because this feature is not productized. > > Summary of the setup: > - GPU protected from dom0 driver using pci-stub (gets access to the GPU via > efifb > until hopefully the GUI domain seizes it) > - host is Xen 4.14, dom0 uses Linux 5.14 (Qubes' kernel-latest) > - guest is a Xen HVM with running in a stub domain, launched through > libvirt/libxl > - hackish enablement of the IGD passthrough codepaths through > - libxl PCI VID hack: > https://github.com/ydirson/xen/commit/4c9d4cb5c3dc1282ba83f17d15072c197b60281c > - qemu BDF hack: > https://github.com/ydirson/qemu/commit/6a165467e25864f1ae17390a44a9c1425ba67aed > > The first problem encountered, i.e. that the guest amdgpu driver was not able > to access the PCI expansion ROM, I have hacked around for now by letting the > driver load as firmware a copy of the ROMdriver [1] - this was a 5.14.15 > kernel > with the QubesOS patches (all reachable from this commit). > > Doing this seems to make the driver happy on this aspect, but several issues > now become visible, and after some digging I feel some insights from people > familiar with the code gets really necessary :) > > The first problems are shown below as [T0], my interpretation being: > 1. Xorg aborts (audit: type=1701) -- should find a way to get more details, > but >that is surely not the root cause of what follows > 2. a PSP command fails -- I cannot find any AMD documentation on how PSP > works, >that could possibly help > 3. the PSP fails to load some firmware as part of its own init -- here I'm > quite >uncomfortable, I thought of the PSP as being distinct from the cpu cores > and >gpu, but here it appears as a disting IP *within* the gpu. I also failed > to >find any detailed description of the whole stuff and their interactions. > 4. following this failure the driver finishes (while initialization was still >ongoing) You need a functional PSP to get the GPU driver up and running. > 5. then vcn_v2_0_sw_fini() triggers a bad memory access, which appeared to be >while dereferencing adev->vcn.inst->fw_shared_cpu_addr. > > After adding traces on the individual IPs init/fini [2] showed that the vcn > sw_init was indeed run, and likely initialized this pointer. Any idea how > it became invalid ? One track I briefly followed was that some of the IP > init appears to be asynchronous (the failure in PSP init occurs after later > IPs get initialized), but that pointer seems to be initialized early and > synchronously by VCN sw_init. > > > Then, to workaround the problem with PSP not being able to initialized, I used > fw_load_type=0 to use direct loading (and noted that fw_load_type=1, > advertised > as loading firmware using SMU, just does not do anything in the code). That will not work on modern GPUs. The PSP is required for firmware loading. Without firmware the various engines on the GPU (GFX, compute, VCN) won't work. > > The result, using 5.15.4 at this time, resulted in trace [T1]. The error > surfacing > now is "ring kiq_2.1.0 test failed" with a timeout. I had to dig the kernel > commit > messages to discover that KIQ is a Kernel Interface Queue, and there are > various > other acronyms around this (eg. "eop", whose introduction seems older than the > landing of the driver in the kernel) which really make it hard to be > efficient at > understanding the code. Will gladly be enlightened :) > > And this also ends with the VCN sw_fini going fireworks, and a quick look at > the > assembler seems to hint that although the code changed a bit, it is still the > same statement crashing. > > Also noticed that ip_block_mask=0xfff7 to disable the PSP on this ASIC will > do slightly > different things, but end up with the same errors. > > > I will gladly take any suggestion, pointers to additional information, etc :) PSP is fundamental to the operation of the GPU. Alex > > Best regards, > -- > Yann > > > [0] https://forum.qubes-os.org/t/amd-igpu-passthrough-attempt/6766/ > [1] > https://github.com/ydirson/linux/commit/4ca50829aa44b29e8428328e913a0546568bf1c0 > [2] >
Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm
Hello, Xen passthrough of a boot GPU those days (at least in the small QubesOS world) is mostly tested/documented for Intel iGPUs (or I missed something). I've been trying to do that with a Renoir GPU (for context, the goal is to have a xen domU dedicated to the GUI [3]). I won't go into all the details of my successive attempts in this email, various (relative) progress reports are available at [0] (there are other things to be investigated listed there, but at least some of them can possibly wait). And I have surely missed more than a couple of key points. Summary of the setup: - GPU protected from dom0 driver using pci-stub (gets access to the GPU via efifb until hopefully the GUI domain seizes it) - host is Xen 4.14, dom0 uses Linux 5.14 (Qubes' kernel-latest) - guest is a Xen HVM with running in a stub domain, launched through libvirt/libxl - hackish enablement of the IGD passthrough codepaths through - libxl PCI VID hack: https://github.com/ydirson/xen/commit/4c9d4cb5c3dc1282ba83f17d15072c197b60281c - qemu BDF hack: https://github.com/ydirson/qemu/commit/6a165467e25864f1ae17390a44a9c1425ba67aed The first problem encountered, i.e. that the guest amdgpu driver was not able to access the PCI expansion ROM, I have hacked around for now by letting the driver load as firmware a copy of the ROMdriver [1] - this was a 5.14.15 kernel with the QubesOS patches (all reachable from this commit). Doing this seems to make the driver happy on this aspect, but several issues now become visible, and after some digging I feel some insights from people familiar with the code gets really necessary :) The first problems are shown below as [T0], my interpretation being: 1. Xorg aborts (audit: type=1701) -- should find a way to get more details, but that is surely not the root cause of what follows 2. a PSP command fails -- I cannot find any AMD documentation on how PSP works, that could possibly help 3. the PSP fails to load some firmware as part of its own init -- here I'm quite uncomfortable, I thought of the PSP as being distinct from the cpu cores and gpu, but here it appears as a disting IP *within* the gpu. I also failed to find any detailed description of the whole stuff and their interactions. 4. following this failure the driver finishes (while initialization was still ongoing) 5. then vcn_v2_0_sw_fini() triggers a bad memory access, which appeared to be while dereferencing adev->vcn.inst->fw_shared_cpu_addr. After adding traces on the individual IPs init/fini [2] showed that the vcn sw_init was indeed run, and likely initialized this pointer. Any idea how it became invalid ? One track I briefly followed was that some of the IP init appears to be asynchronous (the failure in PSP init occurs after later IPs get initialized), but that pointer seems to be initialized early and synchronously by VCN sw_init. Then, to workaround the problem with PSP not being able to initialized, I used fw_load_type=0 to use direct loading (and noted that fw_load_type=1, advertised as loading firmware using SMU, just does not do anything in the code). The result, using 5.15.4 at this time, resulted in trace [T1]. The error surfacing now is "ring kiq_2.1.0 test failed" with a timeout. I had to dig the kernel commit messages to discover that KIQ is a Kernel Interface Queue, and there are various other acronyms around this (eg. "eop", whose introduction seems older than the landing of the driver in the kernel) which really make it hard to be efficient at understanding the code. Will gladly be enlightened :) And this also ends with the VCN sw_fini going fireworks, and a quick look at the assembler seems to hint that although the code changed a bit, it is still the same statement crashing. Also noticed that ip_block_mask=0xfff7 to disable the PSP on this ASIC will do slightly different things, but end up with the same errors. I will gladly take any suggestion, pointers to additional information, etc :) Best regards, -- Yann [0] https://forum.qubes-os.org/t/amd-igpu-passthrough-attempt/6766/ [1] https://github.com/ydirson/linux/commit/4ca50829aa44b29e8428328e913a0546568bf1c0 [2] https://github.com/ydirson/linux/commit/87004f9542b9a80b4fb838697312778cf47e4146 [3] https://www.qubes-os.org/news/2020/03/18/gui-domain/#gpu-passthrough-the-perfect-world-desktop-solution [T0] [2021-11-23 21:05:52] [4.297684] amdgpu :00:05.0: amdgpu: Fetched VBIOS from firmware file [2021-11-23 21:05:52] [4.297709] amdgpu: ATOM BIOS: 113-RENOIR-025 [2021-11-23 21:05:52] [4.302046] [drm] VCN decode is enabled in VM mode [2021-11-23 21:05:52] [4.302066] [drm] VCN encode is enabled in VM mode [2021-11-23 21:05:52] [4.302078] [drm] JPEG decode is enabled in VM mode [2021-11-23 21:05:52] [4.302144] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit [2021-11-23 21:05:52] [4.302181] amdgpu :00:05.0: amdgpu: VRAM: 512M 0x00F4 -