Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2022-01-09 Thread Yann Dirson
Alex wrote:
> On Thu, Jan 6, 2022 at 10:38 AM Yann Dirson  wrote:
> >
> > Alex wrote:
> > > > How is the stolen memory communicated to the driver ?  That
> > > > host
> > > > physical
> > > > memory probably has to be mapped at the same guest physical
> > > > address
> > > > for
> > > > the magic to work, right ?
> > >
> > > Correct.  The driver reads the physical location of that memory
> > > from
> > > hardware registers.  Removing this chunk of code from gmc_v9_0.c
> > > will
> > > force the driver to use the BAR,
> >
> > That would only be a workaround for a missing mapping of stolen
> > memory to the guest, right ?
> 
> 
> Correct. That will use the PCI BAR rather than the underlying
> physical
> memory for CPU access to the carve out region.
> 
> >
> >
> > > but I'm not sure if there are any
> > > other places in the driver that make assumptions about using the
> > > physical host address or not on APUs off hand.
> >
> > gmc_v9_0_vram_gtt_location() updates vm_manager.vram_base_offset
> > from
> > the same value.  I'm not sure I understand why in this case there
> > is
> > no reason to use the BAR while there are some in
> > gmc_v9_0_mc_init().
> >
> > vram_base_offset then gets used in several places:
> >
> > * amdgpu_gmc_init_pdb0, that seems likely enough to be problematic,
> >   right ?
> >   As a sidenote the XGMI offset added earlier gets substracted
> >   here to deduce vram base addr
> >   (a couple of new acronyms there: PDB, PDE -- page directory
> >   base/entry?)
> >
> > * amdgpu_ttm_map_buffer, amdgpu_vm_bo_update_mapping: those seem to
> > be
> >   as problematic
> >
> > * amdgpu_gmc_vram_mc2pa: until I got there I had assumed MC could
> > stand for
> >   "memory controller", but then "MC address of buffer" makes me
> >   doubt
> >
> >
> 
> MC = memory controller (as in graphics memory controller).
> 
> These are GPU addresses not CPU addresses so they should be fine.
> 
> > >
> > > if ((adev->flags & AMD_IS_APU) ||
> > > (adev->gmc.xgmi.supported &&
> > >  adev->gmc.xgmi.connected_to_cpu)) {
> > > adev->gmc.aper_base =
> > > adev->gfxhub.funcs->get_mc_fb_offset(adev)
> > > +
> > > adev->gmc.xgmi.physical_node_id *
> > > adev->gmc.xgmi.node_segment_size;
> > > adev->gmc.aper_size = adev->gmc.real_vram_size;
> > > }
> >
> >
> > Now for the test... it does indeed seem to go much further, I even
> > loose the dom0's efifb to that black screen hopefully showing the
> > driver started to setup the hardware.  Will probably still have to
> > hunt down whether it still tries to use efifb afterwards (can't see
> > why it would not, TBH, given the previous behaviour where it kept
> > using it after the guest failed to start).
> >
> > The log shows many details about TMR loading
> >
> > Then as expected:
> >
> > [2022-01-06 15:16:09] <6>[5.844589] amdgpu :00:05.0:
> > amdgpu: RAP: optional rap ta ucode is not available
> > [2022-01-06 15:16:09] <6>[5.844619] amdgpu :00:05.0:
> > amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
> > [2022-01-06 15:16:09] <7>[5.844639]
> > [drm:amdgpu_device_init.cold [amdgpu]] hw_init (phase2) of IP
> > block ...
> > [2022-01-06 15:16:09] <6>[5.845515] amdgpu :00:05.0:
> > amdgpu: SMU is initialized successfully!
> >
> >
> > not sure about that unhandled interrupt (and a bit worried about
> > messed-up logs):
> >
> > [2022-01-06 15:16:09] <7>[6.010681] amdgpu :00:05.0:
> > [drm:amdgpu_ring_test_hel[2022-01-06 15:16:10] per [amdgpu]] ring
> > test on sdma0 succeeded
> > [2022-01-06 15:16:10] <7>[6.010831] [drm:amdgpu_ih_process
> > [amdgpu]] amdgpu_ih_process: rptr 0, wptr 32
> > [2022-01-06 15:16:10] <7>[6.011002] [drm:amdgpu_irq_dispatch
> > [amdgpu]] Unhandled interrupt src_id: 243
> >
> >
> > then comes a first error:
> >
> > [2022-01-06 15:16:10] <6>[6.011785] [drm] Display Core
> > initialized with v3.2.149!
> > [2022-01-06 15:16:10] <6>[6.012714] [drm] DMUB hardware
> > initialized: version=0x0101001C
> > [2022-01-06 15:16:10] <3>[6.228263] [drm:dc_dmub_srv_wait_idle
> > [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
> > [2022-01-06 15:16:10] <7>[6.229125]
> > [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: freesync_module
> > init done 76c7b459.
> > [2022-01-06 15:16:10] <7>[6.229677]
> > [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] amdgpu: hdcp_workqueue
> > init done 87e28b47.
> > [2022-01-06 15:16:10] <7>[6.229979]
> > [drm:amdgpu_dm_init.isra.0.cold [amdgpu]]
> > amdgpu_dm_connector_init()
> >
> > ... which we can see again several times later though the driver
> > seems sufficient to finish init:
> >
> > [2022-01-06 15:16:10] <6>[6.615615] [drm] late_init of IP block
> > ...
> > [2022-01-06 15:16:10] <6>[6.615772] [drm] late_init of IP block
> > ...
> > [2022-01-06 15:16:10] 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2022-01-06 Thread Alex Deucher
On Thu, Jan 6, 2022 at 10:38 AM Yann Dirson  wrote:
>
> Alex wrote:
> > > How is the stolen memory communicated to the driver ?  That host
> > > physical
> > > memory probably has to be mapped at the same guest physical address
> > > for
> > > the magic to work, right ?
> >
> > Correct.  The driver reads the physical location of that memory from
> > hardware registers.  Removing this chunk of code from gmc_v9_0.c will
> > force the driver to use the BAR,
>
> That would only be a workaround for a missing mapping of stolen
> memory to the guest, right ?


Correct. That will use the PCI BAR rather than the underlying physical
memory for CPU access to the carve out region.

>
>
> > but I'm not sure if there are any
> > other places in the driver that make assumptions about using the
> > physical host address or not on APUs off hand.
>
> gmc_v9_0_vram_gtt_location() updates vm_manager.vram_base_offset from
> the same value.  I'm not sure I understand why in this case there is
> no reason to use the BAR while there are some in gmc_v9_0_mc_init().
>
> vram_base_offset then gets used in several places:
>
> * amdgpu_gmc_init_pdb0, that seems likely enough to be problematic,
>   right ?
>   As a sidenote the XGMI offset added earlier gets substracted
>   here to deduce vram base addr
>   (a couple of new acronyms there: PDB, PDE -- page directory base/entry?)
>
> * amdgpu_ttm_map_buffer, amdgpu_vm_bo_update_mapping: those seem to be
>   as problematic
>
> * amdgpu_gmc_vram_mc2pa: until I got there I had assumed MC could stand for
>   "memory controller", but then "MC address of buffer" makes me doubt
>
>

MC = memory controller (as in graphics memory controller).

These are GPU addresses not CPU addresses so they should be fine.

> >
> > if ((adev->flags & AMD_IS_APU) ||
> > (adev->gmc.xgmi.supported &&
> >  adev->gmc.xgmi.connected_to_cpu)) {
> > adev->gmc.aper_base =
> > adev->gfxhub.funcs->get_mc_fb_offset(adev) +
> > adev->gmc.xgmi.physical_node_id *
> > adev->gmc.xgmi.node_segment_size;
> > adev->gmc.aper_size = adev->gmc.real_vram_size;
> > }
>
>
> Now for the test... it does indeed seem to go much further, I even
> loose the dom0's efifb to that black screen hopefully showing the
> driver started to setup the hardware.  Will probably still have to
> hunt down whether it still tries to use efifb afterwards (can't see
> why it would not, TBH, given the previous behaviour where it kept
> using it after the guest failed to start).
>
> The log shows many details about TMR loading
>
> Then as expected:
>
> [2022-01-06 15:16:09] <6>[5.844589] amdgpu :00:05.0: amdgpu: RAP: 
> optional rap ta ucode is not available
> [2022-01-06 15:16:09] <6>[5.844619] amdgpu :00:05.0: amdgpu: 
> SECUREDISPLAY: securedisplay ta ucode is not available
> [2022-01-06 15:16:09] <7>[5.844639] [drm:amdgpu_device_init.cold 
> [amdgpu]] hw_init (phase2) of IP block ...
> [2022-01-06 15:16:09] <6>[5.845515] amdgpu :00:05.0: amdgpu: SMU is 
> initialized successfully!
>
>
> not sure about that unhandled interrupt (and a bit worried about messed-up 
> logs):
>
> [2022-01-06 15:16:09] <7>[6.010681] amdgpu :00:05.0: 
> [drm:amdgpu_ring_test_hel[2022-01-06 15:16:10] per [amdgpu]] ring test on 
> sdma0 succeeded
> [2022-01-06 15:16:10] <7>[6.010831] [drm:amdgpu_ih_process [amdgpu]] 
> amdgpu_ih_process: rptr 0, wptr 32
> [2022-01-06 15:16:10] <7>[6.011002] [drm:amdgpu_irq_dispatch [amdgpu]] 
> Unhandled interrupt src_id: 243
>
>
> then comes a first error:
>
> [2022-01-06 15:16:10] <6>[6.011785] [drm] Display Core initialized with 
> v3.2.149!
> [2022-01-06 15:16:10] <6>[6.012714] [drm] DMUB hardware initialized: 
> version=0x0101001C
> [2022-01-06 15:16:10] <3>[6.228263] [drm:dc_dmub_srv_wait_idle [amdgpu]] 
> *ERROR* Error waiting for DMUB idle: status=3
> [2022-01-06 15:16:10] <7>[6.229125] [drm:amdgpu_dm_init.isra.0.cold 
> [amdgpu]] amdgpu: freesync_module init done 76c7b459.
> [2022-01-06 15:16:10] <7>[6.229677] [drm:amdgpu_dm_init.isra.0.cold 
> [amdgpu]] amdgpu: hdcp_workqueue init done 87e28b47.
> [2022-01-06 15:16:10] <7>[6.229979] [drm:amdgpu_dm_init.isra.0.cold 
> [amdgpu]] amdgpu_dm_connector_init()
>
> ... which we can see again several times later though the driver seems 
> sufficient to finish init:
>
> [2022-01-06 15:16:10] <6>[6.615615] [drm] late_init of IP block ...
> [2022-01-06 15:16:10] <6>[6.615772] [drm] late_init of IP block 
> ...
> [2022-01-06 15:16:10] <6>[6.615801] [drm] late_init of IP block 
> ...
> [2022-01-06 15:16:10] <6>[6.615827] [drm] late_init of IP block ...
> [2022-01-06 15:16:10] <3>[6.801790] [drm:dc_dmub_srv_wait_idle [amdgpu]] 
> *ERROR* Error waiting for DMUB idle: status=3
> [2022-01-06 15:16:10] <7>[6.806079] [drm:drm_minor_register [drm]]
> [2022-01-06 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2022-01-06 Thread Yann Dirson
Alex wrote:
> > How is the stolen memory communicated to the driver ?  That host
> > physical
> > memory probably has to be mapped at the same guest physical address
> > for
> > the magic to work, right ?
> 
> Correct.  The driver reads the physical location of that memory from
> hardware registers.  Removing this chunk of code from gmc_v9_0.c will
> force the driver to use the BAR,

That would only be a workaround for a missing mapping of stolen
memory to the guest, right ?


> but I'm not sure if there are any
> other places in the driver that make assumptions about using the
> physical host address or not on APUs off hand.

gmc_v9_0_vram_gtt_location() updates vm_manager.vram_base_offset from
the same value.  I'm not sure I understand why in this case there is
no reason to use the BAR while there are some in gmc_v9_0_mc_init().

vram_base_offset then gets used in several places:

* amdgpu_gmc_init_pdb0, that seems likely enough to be problematic,
  right ?  
  As a sidenote the XGMI offset added earlier gets substracted
  here to deduce vram base addr
  (a couple of new acronyms there: PDB, PDE -- page directory base/entry?)

* amdgpu_ttm_map_buffer, amdgpu_vm_bo_update_mapping: those seem to be
  as problematic

* amdgpu_gmc_vram_mc2pa: until I got there I had assumed MC could stand for 
  "memory controller", but then "MC address of buffer" makes me doubt


> 
> if ((adev->flags & AMD_IS_APU) ||
> (adev->gmc.xgmi.supported &&
>  adev->gmc.xgmi.connected_to_cpu)) {
> adev->gmc.aper_base =
> adev->gfxhub.funcs->get_mc_fb_offset(adev) +
> adev->gmc.xgmi.physical_node_id *
> adev->gmc.xgmi.node_segment_size;
> adev->gmc.aper_size = adev->gmc.real_vram_size;
> }


Now for the test... it does indeed seem to go much further, I even
loose the dom0's efifb to that black screen hopefully showing the
driver started to setup the hardware.  Will probably still have to
hunt down whether it still tries to use efifb afterwards (can't see
why it would not, TBH, given the previous behaviour where it kept
using it after the guest failed to start).

The log shows many details about TMR loading

Then as expected:

[2022-01-06 15:16:09] <6>[5.844589] amdgpu :00:05.0: amdgpu: RAP: 
optional rap ta ucode is not available
[2022-01-06 15:16:09] <6>[5.844619] amdgpu :00:05.0: amdgpu: 
SECUREDISPLAY: securedisplay ta ucode is not available
[2022-01-06 15:16:09] <7>[5.844639] [drm:amdgpu_device_init.cold [amdgpu]] 
hw_init (phase2) of IP block ...
[2022-01-06 15:16:09] <6>[5.845515] amdgpu :00:05.0: amdgpu: SMU is 
initialized successfully!


not sure about that unhandled interrupt (and a bit worried about messed-up 
logs):

[2022-01-06 15:16:09] <7>[6.010681] amdgpu :00:05.0: 
[drm:amdgpu_ring_test_hel[2022-01-06 15:16:10] per [amdgpu]] ring test on sdma0 
succeeded
[2022-01-06 15:16:10] <7>[6.010831] [drm:amdgpu_ih_process [amdgpu]] 
amdgpu_ih_process: rptr 0, wptr 32
[2022-01-06 15:16:10] <7>[6.011002] [drm:amdgpu_irq_dispatch [amdgpu]] 
Unhandled interrupt src_id: 243


then comes a first error:

[2022-01-06 15:16:10] <6>[6.011785] [drm] Display Core initialized with 
v3.2.149!
[2022-01-06 15:16:10] <6>[6.012714] [drm] DMUB hardware initialized: 
version=0x0101001C
[2022-01-06 15:16:10] <3>[6.228263] [drm:dc_dmub_srv_wait_idle [amdgpu]] 
*ERROR* Error waiting for DMUB idle: status=3
[2022-01-06 15:16:10] <7>[6.229125] [drm:amdgpu_dm_init.isra.0.cold 
[amdgpu]] amdgpu: freesync_module init done 76c7b459.
[2022-01-06 15:16:10] <7>[6.229677] [drm:amdgpu_dm_init.isra.0.cold 
[amdgpu]] amdgpu: hdcp_workqueue init done 87e28b47.
[2022-01-06 15:16:10] <7>[6.229979] [drm:amdgpu_dm_init.isra.0.cold 
[amdgpu]] amdgpu_dm_connector_init()

... which we can see again several times later though the driver seems 
sufficient to finish init:

[2022-01-06 15:16:10] <6>[6.615615] [drm] late_init of IP block ...
[2022-01-06 15:16:10] <6>[6.615772] [drm] late_init of IP block 
...
[2022-01-06 15:16:10] <6>[6.615801] [drm] late_init of IP block 
...
[2022-01-06 15:16:10] <6>[6.615827] [drm] late_init of IP block ...
[2022-01-06 15:16:10] <3>[6.801790] [drm:dc_dmub_srv_wait_idle [amdgpu]] 
*ERROR* Error waiting for DMUB idle: status=3
[2022-01-06 15:16:10] <7>[6.806079] [drm:drm_minor_register [drm]] 
[2022-01-06 15:16:10] <7>[6.806195] [drm:drm_minor_register [drm]] new 
minor registered 128
[2022-01-06 15:16:10] <7>[6.806223] [drm:drm_minor_register [drm]] 
[2022-01-06 15:16:10] <7>[6.806289] [drm:drm_minor_register [drm]] new 
minor registered 0
[2022-01-06 15:16:10] <7>[6.806355] [drm:drm_sysfs_connector_add [drm]] 
adding "eDP-1" to sysfs
[2022-01-06 15:16:10] <7>[6.806424] [drm:drm_dp_aux_register_devnode 
[drm_kms_helper]] drm_dp_aux_dev: aux [AMDGPU DM aux hw bus 0] 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-29 Thread Alex Deucher
On Wed, Dec 29, 2021 at 12:34 PM Yann Dirson  wrote:
>
> Alex wrote:
> > On Wed, Dec 29, 2021 at 11:59 AM Yann Dirson  wrote:
> > >
> > > Alex wrote:
> > > > On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson 
> > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > > - Mail original -
> > > > > > De: "Alex Deucher" 
> > > > > > À: "Yann Dirson" 
> > > > > > Cc: "Christian König" ,
> > > > > > "amd-gfx list" 
> > > > > > Envoyé: Mardi 21 Décembre 2021 23:31:01
> > > > > > Objet: Re: Various problems trying to vga-passthrough a
> > > > > > Renoir
> > > > > > iGPU to a xen/qubes-os hvm
> > > > > >
> > > > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson 
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > Alex wrote:
> > > > > > > >
> > > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Christian wrote:
> > > > > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > > > > > > > Alex wrote:
> > > > > > > > > > >> Thinking about this more, I think the problem
> > > > > > > > > > >> might be
> > > > > > > > > > >> related
> > > > > > > > > > >> to
> > > > > > > > > > >> CPU
> > > > > > > > > > >> access to "VRAM".  APUs don't have dedicated VRAM,
> > > > > > > > > > >> they
> > > > > > > > > > >> use a
> > > > > > > > > > >> reserved
> > > > > > > > > > >> carve out region at the top of system memory.  For
> > > > > > > > > > >> CPU
> > > > > > > > > > >> access
> > > > > > > > > > >> to
> > > > > > > > > > >> this
> > > > > > > > > > >> memory, we kmap the physical address of the carve
> > > > > > > > > > >> out
> > > > > > > > > > >> region
> > > > > > > > > > >> of
> > > > > > > > > > >> system
> > > > > > > > > > >> memory.  You'll need to make sure that region is
> > > > > > > > > > >> accessible to
> > > > > > > > > > >> the
> > > > > > > > > > >> guest.
> > > > > > > > > > > So basically, the non-virt flow is is: (video?)
> > > > > > > > > > > BIOS
> > > > > > > > > > > reserves
> > > > > > > > > > > memory, marks it
> > > > > > > > > > > as reserved in e820, stores the physaddr somewhere,
> > > > > > > > > > > which
> > > > > > > > > > > the
> > > > > > > > > > > GPU
> > > > > > > > > > > driver gets.
> > > > > > > > > > > Since I suppose this includes the framebuffer, this
> > > > > > > > > > > probably
> > > > > > > > > > > has to
> > > > > > > > > > > occur around
> > > > > > > > > > > the moment the driver calls
> > > > > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > > > > > > > (which happens before this hw init step), right ?
> > > > > > > > > >
> > > > > > > > > > Well, that partially correct. The efifb is using the
> > > > > > > > > > PCIe
> > > > > > > > > > resources
> > > > > > > > > > to
> > > > > > > > > > access the framebuffer and as far as I know we use
> > > > > > > > > > that
> > > > > > > > > > one
> > > > > > > > > > to
> > > > > > > > > > kick
> > > > > > > > > > it out.
> > > > > > > > > >
> > > > > > > > > > The stolen memory we get over e820/registers is
> > > > > > > > > > separate
> > > > > > > > > > to
> > > > > > > > > > that.
> > > > > > >
> > > > > > > How is the stolen memory communicated to the driver ?  That
> > > > > > > host
> > > > > > > physical
> > > > > > > memory probably has to be mapped at the same guest physical
> > > > > > > address
> > > > > > > for
> > > > > > > the magic to work, right ?
> > > > > >
> > > > > > Correct.  The driver reads the physical location of that
> > > > > > memory
> > > > > > from
> > > > > > hardware registers.  Removing this chunk of code from
> > > > > > gmc_v9_0.c
> > > > > > will
> > > > > > force the driver to use the BAR, but I'm not sure if there
> > > > > > are
> > > > > > any
> > > > > > other places in the driver that make assumptions about using
> > > > > > the
> > > > > > physical host address or not on APUs off hand.
> > > > > >
> > > > > > if ((adev->flags & AMD_IS_APU) ||
> > > > > > (adev->gmc.xgmi.supported &&
> > > > > >  adev->gmc.xgmi.connected_to_cpu)) {
> > > > > > adev->gmc.aper_base =
> > > > > > adev->gfxhub.funcs->get_mc_fb_offset(adev)
> > > > > > +
> > > > > > adev->gmc.xgmi.physical_node_id *
> > > > > > adev->gmc.xgmi.node_segment_size;
> > > > > > adev->gmc.aper_size =
> > > > > > adev->gmc.real_vram_size;
> > > > > > }
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > > >
> > > > > > > > > > > ... which brings me to a point that's been puzzling
> > > > > > > > > > > me
> > > > > > > > > > > for
> > > > > > > > > > > some
> > > > > > > > > > > time, which is
> > > > > > > > > > > that as the hw init fails, the efifb driver is
> > > > > > > > > > > still
> > > > > > > > > > > using
> > > > > > > > > > > the
> > > > > > > > > > > framebuffer.
> > > > > > > > > >
> > > > > > > > > > No, it 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-29 Thread Yann Dirson
Alex wrote:
> On Wed, Dec 29, 2021 at 11:59 AM Yann Dirson  wrote:
> >
> > Alex wrote:
> > > On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson 
> > > wrote:
> > > >
> > > >
> > > >
> > > > - Mail original -
> > > > > De: "Alex Deucher" 
> > > > > À: "Yann Dirson" 
> > > > > Cc: "Christian König" ,
> > > > > "amd-gfx list" 
> > > > > Envoyé: Mardi 21 Décembre 2021 23:31:01
> > > > > Objet: Re: Various problems trying to vga-passthrough a
> > > > > Renoir
> > > > > iGPU to a xen/qubes-os hvm
> > > > >
> > > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson 
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > > Alex wrote:
> > > > > > >
> > > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson
> > > > > > > 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Christian wrote:
> > > > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > > > > > > Alex wrote:
> > > > > > > > > >> Thinking about this more, I think the problem
> > > > > > > > > >> might be
> > > > > > > > > >> related
> > > > > > > > > >> to
> > > > > > > > > >> CPU
> > > > > > > > > >> access to "VRAM".  APUs don't have dedicated VRAM,
> > > > > > > > > >> they
> > > > > > > > > >> use a
> > > > > > > > > >> reserved
> > > > > > > > > >> carve out region at the top of system memory.  For
> > > > > > > > > >> CPU
> > > > > > > > > >> access
> > > > > > > > > >> to
> > > > > > > > > >> this
> > > > > > > > > >> memory, we kmap the physical address of the carve
> > > > > > > > > >> out
> > > > > > > > > >> region
> > > > > > > > > >> of
> > > > > > > > > >> system
> > > > > > > > > >> memory.  You'll need to make sure that region is
> > > > > > > > > >> accessible to
> > > > > > > > > >> the
> > > > > > > > > >> guest.
> > > > > > > > > > So basically, the non-virt flow is is: (video?)
> > > > > > > > > > BIOS
> > > > > > > > > > reserves
> > > > > > > > > > memory, marks it
> > > > > > > > > > as reserved in e820, stores the physaddr somewhere,
> > > > > > > > > > which
> > > > > > > > > > the
> > > > > > > > > > GPU
> > > > > > > > > > driver gets.
> > > > > > > > > > Since I suppose this includes the framebuffer, this
> > > > > > > > > > probably
> > > > > > > > > > has to
> > > > > > > > > > occur around
> > > > > > > > > > the moment the driver calls
> > > > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > > > > > > (which happens before this hw init step), right ?
> > > > > > > > >
> > > > > > > > > Well, that partially correct. The efifb is using the
> > > > > > > > > PCIe
> > > > > > > > > resources
> > > > > > > > > to
> > > > > > > > > access the framebuffer and as far as I know we use
> > > > > > > > > that
> > > > > > > > > one
> > > > > > > > > to
> > > > > > > > > kick
> > > > > > > > > it out.
> > > > > > > > >
> > > > > > > > > The stolen memory we get over e820/registers is
> > > > > > > > > separate
> > > > > > > > > to
> > > > > > > > > that.
> > > > > >
> > > > > > How is the stolen memory communicated to the driver ?  That
> > > > > > host
> > > > > > physical
> > > > > > memory probably has to be mapped at the same guest physical
> > > > > > address
> > > > > > for
> > > > > > the magic to work, right ?
> > > > >
> > > > > Correct.  The driver reads the physical location of that
> > > > > memory
> > > > > from
> > > > > hardware registers.  Removing this chunk of code from
> > > > > gmc_v9_0.c
> > > > > will
> > > > > force the driver to use the BAR, but I'm not sure if there
> > > > > are
> > > > > any
> > > > > other places in the driver that make assumptions about using
> > > > > the
> > > > > physical host address or not on APUs off hand.
> > > > >
> > > > > if ((adev->flags & AMD_IS_APU) ||
> > > > > (adev->gmc.xgmi.supported &&
> > > > >  adev->gmc.xgmi.connected_to_cpu)) {
> > > > > adev->gmc.aper_base =
> > > > > adev->gfxhub.funcs->get_mc_fb_offset(adev)
> > > > > +
> > > > > adev->gmc.xgmi.physical_node_id *
> > > > > adev->gmc.xgmi.node_segment_size;
> > > > > adev->gmc.aper_size =
> > > > > adev->gmc.real_vram_size;
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > > >
> > > > > > > > > > ... which brings me to a point that's been puzzling
> > > > > > > > > > me
> > > > > > > > > > for
> > > > > > > > > > some
> > > > > > > > > > time, which is
> > > > > > > > > > that as the hw init fails, the efifb driver is
> > > > > > > > > > still
> > > > > > > > > > using
> > > > > > > > > > the
> > > > > > > > > > framebuffer.
> > > > > > > > >
> > > > > > > > > No, it isn't. You are probably just still seeing the
> > > > > > > > > same
> > > > > > > > > screen.
> > > > > > > > >
> > > > > > > > > The issue is most likely that while efi was kicked
> > > > > > > > > out
> > > > > > > > > nobody
> > > > > > > > > re-programmed the display hardware to show something
> > > > > > > > > different.

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-29 Thread Alex Deucher
On Wed, Dec 29, 2021 at 11:59 AM Yann Dirson  wrote:
>
> Alex wrote:
> > On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson  wrote:
> > >
> > >
> > >
> > > - Mail original -
> > > > De: "Alex Deucher" 
> > > > À: "Yann Dirson" 
> > > > Cc: "Christian König" ,
> > > > "amd-gfx list" 
> > > > Envoyé: Mardi 21 Décembre 2021 23:31:01
> > > > Objet: Re: Various problems trying to vga-passthrough a Renoir
> > > > iGPU to a xen/qubes-os hvm
> > > >
> > > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson 
> > > > wrote:
> > > > >
> > > > >
> > > > > Alex wrote:
> > > > > >
> > > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson
> > > > > > 
> > > > > > wrote:
> > > > > > >
> > > > > > > Christian wrote:
> > > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > > > > > Alex wrote:
> > > > > > > > >> Thinking about this more, I think the problem might be
> > > > > > > > >> related
> > > > > > > > >> to
> > > > > > > > >> CPU
> > > > > > > > >> access to "VRAM".  APUs don't have dedicated VRAM,
> > > > > > > > >> they
> > > > > > > > >> use a
> > > > > > > > >> reserved
> > > > > > > > >> carve out region at the top of system memory.  For CPU
> > > > > > > > >> access
> > > > > > > > >> to
> > > > > > > > >> this
> > > > > > > > >> memory, we kmap the physical address of the carve out
> > > > > > > > >> region
> > > > > > > > >> of
> > > > > > > > >> system
> > > > > > > > >> memory.  You'll need to make sure that region is
> > > > > > > > >> accessible to
> > > > > > > > >> the
> > > > > > > > >> guest.
> > > > > > > > > So basically, the non-virt flow is is: (video?) BIOS
> > > > > > > > > reserves
> > > > > > > > > memory, marks it
> > > > > > > > > as reserved in e820, stores the physaddr somewhere,
> > > > > > > > > which
> > > > > > > > > the
> > > > > > > > > GPU
> > > > > > > > > driver gets.
> > > > > > > > > Since I suppose this includes the framebuffer, this
> > > > > > > > > probably
> > > > > > > > > has to
> > > > > > > > > occur around
> > > > > > > > > the moment the driver calls
> > > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > > > > > (which happens before this hw init step), right ?
> > > > > > > >
> > > > > > > > Well, that partially correct. The efifb is using the PCIe
> > > > > > > > resources
> > > > > > > > to
> > > > > > > > access the framebuffer and as far as I know we use that
> > > > > > > > one
> > > > > > > > to
> > > > > > > > kick
> > > > > > > > it out.
> > > > > > > >
> > > > > > > > The stolen memory we get over e820/registers is separate
> > > > > > > > to
> > > > > > > > that.
> > > > >
> > > > > How is the stolen memory communicated to the driver ?  That
> > > > > host
> > > > > physical
> > > > > memory probably has to be mapped at the same guest physical
> > > > > address
> > > > > for
> > > > > the magic to work, right ?
> > > >
> > > > Correct.  The driver reads the physical location of that memory
> > > > from
> > > > hardware registers.  Removing this chunk of code from gmc_v9_0.c
> > > > will
> > > > force the driver to use the BAR, but I'm not sure if there are
> > > > any
> > > > other places in the driver that make assumptions about using the
> > > > physical host address or not on APUs off hand.
> > > >
> > > > if ((adev->flags & AMD_IS_APU) ||
> > > > (adev->gmc.xgmi.supported &&
> > > >  adev->gmc.xgmi.connected_to_cpu)) {
> > > > adev->gmc.aper_base =
> > > > adev->gfxhub.funcs->get_mc_fb_offset(adev)
> > > > +
> > > > adev->gmc.xgmi.physical_node_id *
> > > > adev->gmc.xgmi.node_segment_size;
> > > > adev->gmc.aper_size = adev->gmc.real_vram_size;
> > > > }
> > > >
> > > >
> > > >
> > > > >
> > > > > > > >
> > > > > > > > > ... which brings me to a point that's been puzzling me
> > > > > > > > > for
> > > > > > > > > some
> > > > > > > > > time, which is
> > > > > > > > > that as the hw init fails, the efifb driver is still
> > > > > > > > > using
> > > > > > > > > the
> > > > > > > > > framebuffer.
> > > > > > > >
> > > > > > > > No, it isn't. You are probably just still seeing the same
> > > > > > > > screen.
> > > > > > > >
> > > > > > > > The issue is most likely that while efi was kicked out
> > > > > > > > nobody
> > > > > > > > re-programmed the display hardware to show something
> > > > > > > > different.
> > > > > > > >
> > > > > > > > > Am I right in suspecting that efifb should get stripped
> > > > > > > > > of
> > > > > > > > > its
> > > > > > > > > ownership of the
> > > > > > > > > fb aperture first, and that if I don't get a black
> > > > > > > > > screen
> > > > > > > > > on
> > > > > > > > > hw_init failure
> > > > > > > > > that issue should be the first focus point ?
> > > > > > > >
> > > > > > > > You assumption with the black screen is incorrect. Since
> > > > > > > > the
> > > > > > > > hardware
> > > > > > > > works independent even if you 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-29 Thread Yann Dirson
Alex wrote:
> On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson  wrote:
> >
> >
> >
> > - Mail original -
> > > De: "Alex Deucher" 
> > > À: "Yann Dirson" 
> > > Cc: "Christian König" ,
> > > "amd-gfx list" 
> > > Envoyé: Mardi 21 Décembre 2021 23:31:01
> > > Objet: Re: Various problems trying to vga-passthrough a Renoir
> > > iGPU to a xen/qubes-os hvm
> > >
> > > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson 
> > > wrote:
> > > >
> > > >
> > > > Alex wrote:
> > > > >
> > > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson
> > > > > 
> > > > > wrote:
> > > > > >
> > > > > > Christian wrote:
> > > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > > > > Alex wrote:
> > > > > > > >> Thinking about this more, I think the problem might be
> > > > > > > >> related
> > > > > > > >> to
> > > > > > > >> CPU
> > > > > > > >> access to "VRAM".  APUs don't have dedicated VRAM,
> > > > > > > >> they
> > > > > > > >> use a
> > > > > > > >> reserved
> > > > > > > >> carve out region at the top of system memory.  For CPU
> > > > > > > >> access
> > > > > > > >> to
> > > > > > > >> this
> > > > > > > >> memory, we kmap the physical address of the carve out
> > > > > > > >> region
> > > > > > > >> of
> > > > > > > >> system
> > > > > > > >> memory.  You'll need to make sure that region is
> > > > > > > >> accessible to
> > > > > > > >> the
> > > > > > > >> guest.
> > > > > > > > So basically, the non-virt flow is is: (video?) BIOS
> > > > > > > > reserves
> > > > > > > > memory, marks it
> > > > > > > > as reserved in e820, stores the physaddr somewhere,
> > > > > > > > which
> > > > > > > > the
> > > > > > > > GPU
> > > > > > > > driver gets.
> > > > > > > > Since I suppose this includes the framebuffer, this
> > > > > > > > probably
> > > > > > > > has to
> > > > > > > > occur around
> > > > > > > > the moment the driver calls
> > > > > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > > > > (which happens before this hw init step), right ?
> > > > > > >
> > > > > > > Well, that partially correct. The efifb is using the PCIe
> > > > > > > resources
> > > > > > > to
> > > > > > > access the framebuffer and as far as I know we use that
> > > > > > > one
> > > > > > > to
> > > > > > > kick
> > > > > > > it out.
> > > > > > >
> > > > > > > The stolen memory we get over e820/registers is separate
> > > > > > > to
> > > > > > > that.
> > > >
> > > > How is the stolen memory communicated to the driver ?  That
> > > > host
> > > > physical
> > > > memory probably has to be mapped at the same guest physical
> > > > address
> > > > for
> > > > the magic to work, right ?
> > >
> > > Correct.  The driver reads the physical location of that memory
> > > from
> > > hardware registers.  Removing this chunk of code from gmc_v9_0.c
> > > will
> > > force the driver to use the BAR, but I'm not sure if there are
> > > any
> > > other places in the driver that make assumptions about using the
> > > physical host address or not on APUs off hand.
> > >
> > > if ((adev->flags & AMD_IS_APU) ||
> > > (adev->gmc.xgmi.supported &&
> > >  adev->gmc.xgmi.connected_to_cpu)) {
> > > adev->gmc.aper_base =
> > > adev->gfxhub.funcs->get_mc_fb_offset(adev)
> > > +
> > > adev->gmc.xgmi.physical_node_id *
> > > adev->gmc.xgmi.node_segment_size;
> > > adev->gmc.aper_size = adev->gmc.real_vram_size;
> > > }
> > >
> > >
> > >
> > > >
> > > > > > >
> > > > > > > > ... which brings me to a point that's been puzzling me
> > > > > > > > for
> > > > > > > > some
> > > > > > > > time, which is
> > > > > > > > that as the hw init fails, the efifb driver is still
> > > > > > > > using
> > > > > > > > the
> > > > > > > > framebuffer.
> > > > > > >
> > > > > > > No, it isn't. You are probably just still seeing the same
> > > > > > > screen.
> > > > > > >
> > > > > > > The issue is most likely that while efi was kicked out
> > > > > > > nobody
> > > > > > > re-programmed the display hardware to show something
> > > > > > > different.
> > > > > > >
> > > > > > > > Am I right in suspecting that efifb should get stripped
> > > > > > > > of
> > > > > > > > its
> > > > > > > > ownership of the
> > > > > > > > fb aperture first, and that if I don't get a black
> > > > > > > > screen
> > > > > > > > on
> > > > > > > > hw_init failure
> > > > > > > > that issue should be the first focus point ?
> > > > > > >
> > > > > > > You assumption with the black screen is incorrect. Since
> > > > > > > the
> > > > > > > hardware
> > > > > > > works independent even if you kick out efi you still have
> > > > > > > the
> > > > > > > same
> > > > > > > screen content, you just can't update it anymore.
> > > > > >
> > > > > > It's not only that the screen keeps its contents, it's that
> > > > > > the
> > > > > > dom0
> > > > > > happily continues updating it.
> > > > >
> > > > > If the hypevisor is 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-22 Thread Alex Deucher
On Tue, Dec 21, 2021 at 6:09 PM Yann Dirson  wrote:
>
>
>
> - Mail original -
> > De: "Alex Deucher" 
> > À: "Yann Dirson" 
> > Cc: "Christian König" , "amd-gfx list" 
> > 
> > Envoyé: Mardi 21 Décembre 2021 23:31:01
> > Objet: Re: Various problems trying to vga-passthrough a Renoir iGPU to a 
> > xen/qubes-os hvm
> >
> > On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson  wrote:
> > >
> > >
> > > Alex wrote:
> > > >
> > > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson 
> > > > wrote:
> > > > >
> > > > > Christian wrote:
> > > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > > > Alex wrote:
> > > > > > >> Thinking about this more, I think the problem might be
> > > > > > >> related
> > > > > > >> to
> > > > > > >> CPU
> > > > > > >> access to "VRAM".  APUs don't have dedicated VRAM, they
> > > > > > >> use a
> > > > > > >> reserved
> > > > > > >> carve out region at the top of system memory.  For CPU
> > > > > > >> access
> > > > > > >> to
> > > > > > >> this
> > > > > > >> memory, we kmap the physical address of the carve out
> > > > > > >> region
> > > > > > >> of
> > > > > > >> system
> > > > > > >> memory.  You'll need to make sure that region is
> > > > > > >> accessible to
> > > > > > >> the
> > > > > > >> guest.
> > > > > > > So basically, the non-virt flow is is: (video?) BIOS
> > > > > > > reserves
> > > > > > > memory, marks it
> > > > > > > as reserved in e820, stores the physaddr somewhere, which
> > > > > > > the
> > > > > > > GPU
> > > > > > > driver gets.
> > > > > > > Since I suppose this includes the framebuffer, this
> > > > > > > probably
> > > > > > > has to
> > > > > > > occur around
> > > > > > > the moment the driver calls
> > > > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > > > (which happens before this hw init step), right ?
> > > > > >
> > > > > > Well, that partially correct. The efifb is using the PCIe
> > > > > > resources
> > > > > > to
> > > > > > access the framebuffer and as far as I know we use that one
> > > > > > to
> > > > > > kick
> > > > > > it out.
> > > > > >
> > > > > > The stolen memory we get over e820/registers is separate to
> > > > > > that.
> > >
> > > How is the stolen memory communicated to the driver ?  That host
> > > physical
> > > memory probably has to be mapped at the same guest physical address
> > > for
> > > the magic to work, right ?
> >
> > Correct.  The driver reads the physical location of that memory from
> > hardware registers.  Removing this chunk of code from gmc_v9_0.c will
> > force the driver to use the BAR, but I'm not sure if there are any
> > other places in the driver that make assumptions about using the
> > physical host address or not on APUs off hand.
> >
> > if ((adev->flags & AMD_IS_APU) ||
> > (adev->gmc.xgmi.supported &&
> >  adev->gmc.xgmi.connected_to_cpu)) {
> > adev->gmc.aper_base =
> > adev->gfxhub.funcs->get_mc_fb_offset(adev) +
> > adev->gmc.xgmi.physical_node_id *
> > adev->gmc.xgmi.node_segment_size;
> > adev->gmc.aper_size = adev->gmc.real_vram_size;
> > }
> >
> >
> >
> > >
> > > > > >
> > > > > > > ... which brings me to a point that's been puzzling me for
> > > > > > > some
> > > > > > > time, which is
> > > > > > > that as the hw init fails, the efifb driver is still using
> > > > > > > the
> > > > > > > framebuffer.
> > > > > >
> > > > > > No, it isn't. You are probably just still seeing the same
> > > > > > screen.
> > > > > >
> > > > > > The issue is most likely that while efi was kicked out nobody
> > > > > > re-programmed the display hardware to show something
> > > > > > different.
> > > > > >
> > > > > > > Am I right in suspecting that efifb should get stripped of
> > > > > > > its
> > > > > > > ownership of the
> > > > > > > fb aperture first, and that if I don't get a black screen
> > > > > > > on
> > > > > > > hw_init failure
> > > > > > > that issue should be the first focus point ?
> > > > > >
> > > > > > You assumption with the black screen is incorrect. Since the
> > > > > > hardware
> > > > > > works independent even if you kick out efi you still have the
> > > > > > same
> > > > > > screen content, you just can't update it anymore.
> > > > >
> > > > > It's not only that the screen keeps its contents, it's that the
> > > > > dom0
> > > > > happily continues updating it.
> > > >
> > > > If the hypevisor is using efifb, then yes that could be a problem
> > > > as
> > > > the hypervisor could be writing to the efifb resources which ends
> > > > up
> > > > writing to the same physical memory.  That applies to any GPU on
> > > > a
> > > > UEFI system.  You'll need to make sure efifb is not in use in the
> > > > hypervisor.
> > >
> > > That remark evokes several things to me.  First one is that every
> > > time
> > > I've tried booting with efifb disabled in dom0, there was no
> > > visible
> > > improvements in the guest driver 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Yann Dirson



- Mail original -
> De: "Alex Deucher" 
> À: "Yann Dirson" 
> Cc: "Christian König" , "amd-gfx list" 
> 
> Envoyé: Mardi 21 Décembre 2021 23:31:01
> Objet: Re: Various problems trying to vga-passthrough a Renoir iGPU to a 
> xen/qubes-os hvm
> 
> On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson  wrote:
> >
> >
> > Alex wrote:
> > >
> > > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson 
> > > wrote:
> > > >
> > > > Christian wrote:
> > > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > > Alex wrote:
> > > > > >> Thinking about this more, I think the problem might be
> > > > > >> related
> > > > > >> to
> > > > > >> CPU
> > > > > >> access to "VRAM".  APUs don't have dedicated VRAM, they
> > > > > >> use a
> > > > > >> reserved
> > > > > >> carve out region at the top of system memory.  For CPU
> > > > > >> access
> > > > > >> to
> > > > > >> this
> > > > > >> memory, we kmap the physical address of the carve out
> > > > > >> region
> > > > > >> of
> > > > > >> system
> > > > > >> memory.  You'll need to make sure that region is
> > > > > >> accessible to
> > > > > >> the
> > > > > >> guest.
> > > > > > So basically, the non-virt flow is is: (video?) BIOS
> > > > > > reserves
> > > > > > memory, marks it
> > > > > > as reserved in e820, stores the physaddr somewhere, which
> > > > > > the
> > > > > > GPU
> > > > > > driver gets.
> > > > > > Since I suppose this includes the framebuffer, this
> > > > > > probably
> > > > > > has to
> > > > > > occur around
> > > > > > the moment the driver calls
> > > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > > (which happens before this hw init step), right ?
> > > > >
> > > > > Well, that partially correct. The efifb is using the PCIe
> > > > > resources
> > > > > to
> > > > > access the framebuffer and as far as I know we use that one
> > > > > to
> > > > > kick
> > > > > it out.
> > > > >
> > > > > The stolen memory we get over e820/registers is separate to
> > > > > that.
> >
> > How is the stolen memory communicated to the driver ?  That host
> > physical
> > memory probably has to be mapped at the same guest physical address
> > for
> > the magic to work, right ?
> 
> Correct.  The driver reads the physical location of that memory from
> hardware registers.  Removing this chunk of code from gmc_v9_0.c will
> force the driver to use the BAR, but I'm not sure if there are any
> other places in the driver that make assumptions about using the
> physical host address or not on APUs off hand.
> 
> if ((adev->flags & AMD_IS_APU) ||
> (adev->gmc.xgmi.supported &&
>  adev->gmc.xgmi.connected_to_cpu)) {
> adev->gmc.aper_base =
> adev->gfxhub.funcs->get_mc_fb_offset(adev) +
> adev->gmc.xgmi.physical_node_id *
> adev->gmc.xgmi.node_segment_size;
> adev->gmc.aper_size = adev->gmc.real_vram_size;
> }
> 
> 
> 
> >
> > > > >
> > > > > > ... which brings me to a point that's been puzzling me for
> > > > > > some
> > > > > > time, which is
> > > > > > that as the hw init fails, the efifb driver is still using
> > > > > > the
> > > > > > framebuffer.
> > > > >
> > > > > No, it isn't. You are probably just still seeing the same
> > > > > screen.
> > > > >
> > > > > The issue is most likely that while efi was kicked out nobody
> > > > > re-programmed the display hardware to show something
> > > > > different.
> > > > >
> > > > > > Am I right in suspecting that efifb should get stripped of
> > > > > > its
> > > > > > ownership of the
> > > > > > fb aperture first, and that if I don't get a black screen
> > > > > > on
> > > > > > hw_init failure
> > > > > > that issue should be the first focus point ?
> > > > >
> > > > > You assumption with the black screen is incorrect. Since the
> > > > > hardware
> > > > > works independent even if you kick out efi you still have the
> > > > > same
> > > > > screen content, you just can't update it anymore.
> > > >
> > > > It's not only that the screen keeps its contents, it's that the
> > > > dom0
> > > > happily continues updating it.
> > >
> > > If the hypevisor is using efifb, then yes that could be a problem
> > > as
> > > the hypervisor could be writing to the efifb resources which ends
> > > up
> > > writing to the same physical memory.  That applies to any GPU on
> > > a
> > > UEFI system.  You'll need to make sure efifb is not in use in the
> > > hypervisor.
> >
> > That remark evokes several things to me.  First one is that every
> > time
> > I've tried booting with efifb disabled in dom0, there was no
> > visible
> > improvements in the guest driver - i.i. I really have to dig how
> > vram mapping
> > is performed and check things are as expected anyway.
> 
> Ultimately you end up at the same physical memory.  efifb uses the
> PCI
> BAR which points to the same physical memory that the driver directly
> maps.
> 
> >
> > The other is that, when dom0 cannot use efifb, 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Alex Deucher
On Tue, Dec 21, 2021 at 5:12 PM Yann Dirson  wrote:
>
>
> Alex wrote:
> >
> > On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson  wrote:
> > >
> > > Christian wrote:
> > > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > > Alex wrote:
> > > > >> Thinking about this more, I think the problem might be related
> > > > >> to
> > > > >> CPU
> > > > >> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> > > > >> reserved
> > > > >> carve out region at the top of system memory.  For CPU access
> > > > >> to
> > > > >> this
> > > > >> memory, we kmap the physical address of the carve out region
> > > > >> of
> > > > >> system
> > > > >> memory.  You'll need to make sure that region is accessible to
> > > > >> the
> > > > >> guest.
> > > > > So basically, the non-virt flow is is: (video?) BIOS reserves
> > > > > memory, marks it
> > > > > as reserved in e820, stores the physaddr somewhere, which the
> > > > > GPU
> > > > > driver gets.
> > > > > Since I suppose this includes the framebuffer, this probably
> > > > > has to
> > > > > occur around
> > > > > the moment the driver calls
> > > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > > (which happens before this hw init step), right ?
> > > >
> > > > Well, that partially correct. The efifb is using the PCIe
> > > > resources
> > > > to
> > > > access the framebuffer and as far as I know we use that one to
> > > > kick
> > > > it out.
> > > >
> > > > The stolen memory we get over e820/registers is separate to that.
>
> How is the stolen memory communicated to the driver ?  That host physical
> memory probably has to be mapped at the same guest physical address for
> the magic to work, right ?

Correct.  The driver reads the physical location of that memory from
hardware registers.  Removing this chunk of code from gmc_v9_0.c will
force the driver to use the BAR, but I'm not sure if there are any
other places in the driver that make assumptions about using the
physical host address or not on APUs off hand.

if ((adev->flags & AMD_IS_APU) ||
(adev->gmc.xgmi.supported &&
 adev->gmc.xgmi.connected_to_cpu)) {
adev->gmc.aper_base =
adev->gfxhub.funcs->get_mc_fb_offset(adev) +
adev->gmc.xgmi.physical_node_id *
adev->gmc.xgmi.node_segment_size;
adev->gmc.aper_size = adev->gmc.real_vram_size;
}



>
> > > >
> > > > > ... which brings me to a point that's been puzzling me for some
> > > > > time, which is
> > > > > that as the hw init fails, the efifb driver is still using the
> > > > > framebuffer.
> > > >
> > > > No, it isn't. You are probably just still seeing the same screen.
> > > >
> > > > The issue is most likely that while efi was kicked out nobody
> > > > re-programmed the display hardware to show something different.
> > > >
> > > > > Am I right in suspecting that efifb should get stripped of its
> > > > > ownership of the
> > > > > fb aperture first, and that if I don't get a black screen on
> > > > > hw_init failure
> > > > > that issue should be the first focus point ?
> > > >
> > > > You assumption with the black screen is incorrect. Since the
> > > > hardware
> > > > works independent even if you kick out efi you still have the
> > > > same
> > > > screen content, you just can't update it anymore.
> > >
> > > It's not only that the screen keeps its contents, it's that the
> > > dom0
> > > happily continues updating it.
> >
> > If the hypevisor is using efifb, then yes that could be a problem as
> > the hypervisor could be writing to the efifb resources which ends up
> > writing to the same physical memory.  That applies to any GPU on a
> > UEFI system.  You'll need to make sure efifb is not in use in the
> > hypervisor.
>
> That remark evokes several things to me.  First one is that every time
> I've tried booting with efifb disabled in dom0, there was no visible
> improvements in the guest driver - i.i. I really have to dig how vram mapping
> is performed and check things are as expected anyway.

Ultimately you end up at the same physical memory.  efifb uses the PCI
BAR which points to the same physical memory that the driver directly
maps.

>
> The other is that, when dom0 cannot use efifb, entering a luks key is
> suddenly less user-friendly.  But in theory I'd think we could overcome
> this by letting dom0 use efifb until ready to start the guest, a simple
> driver unbind at the right moment should be expected to work, right ?
> Going further and allowing the guest to use efifb on its own could
> possibly be more tricky (starting with a different state?) but does
> not seem to sound completely outlandish either - or does it ?
>

efifb just takes whatever hardware state the GOP driver in the pre-OS
environment left the GPU in.  Once you have a driver loaded in the OS,
that state is gone so I I don't see much value in using efifb once you
have a real driver in the mix.  If you want a console 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Yann Dirson


Alex wrote:
> 
> On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson  wrote:
> >
> > Christian wrote:
> > > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > > Alex wrote:
> > > >> Thinking about this more, I think the problem might be related
> > > >> to
> > > >> CPU
> > > >> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> > > >> reserved
> > > >> carve out region at the top of system memory.  For CPU access
> > > >> to
> > > >> this
> > > >> memory, we kmap the physical address of the carve out region
> > > >> of
> > > >> system
> > > >> memory.  You'll need to make sure that region is accessible to
> > > >> the
> > > >> guest.
> > > > So basically, the non-virt flow is is: (video?) BIOS reserves
> > > > memory, marks it
> > > > as reserved in e820, stores the physaddr somewhere, which the
> > > > GPU
> > > > driver gets.
> > > > Since I suppose this includes the framebuffer, this probably
> > > > has to
> > > > occur around
> > > > the moment the driver calls
> > > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > > (which happens before this hw init step), right ?
> > >
> > > Well, that partially correct. The efifb is using the PCIe
> > > resources
> > > to
> > > access the framebuffer and as far as I know we use that one to
> > > kick
> > > it out.
> > >
> > > The stolen memory we get over e820/registers is separate to that.

How is the stolen memory communicated to the driver ?  That host physical
memory probably has to be mapped at the same guest physical address for
the magic to work, right ?

> > >
> > > > ... which brings me to a point that's been puzzling me for some
> > > > time, which is
> > > > that as the hw init fails, the efifb driver is still using the
> > > > framebuffer.
> > >
> > > No, it isn't. You are probably just still seeing the same screen.
> > >
> > > The issue is most likely that while efi was kicked out nobody
> > > re-programmed the display hardware to show something different.
> > >
> > > > Am I right in suspecting that efifb should get stripped of its
> > > > ownership of the
> > > > fb aperture first, and that if I don't get a black screen on
> > > > hw_init failure
> > > > that issue should be the first focus point ?
> > >
> > > You assumption with the black screen is incorrect. Since the
> > > hardware
> > > works independent even if you kick out efi you still have the
> > > same
> > > screen content, you just can't update it anymore.
> >
> > It's not only that the screen keeps its contents, it's that the
> > dom0
> > happily continues updating it.
> 
> If the hypevisor is using efifb, then yes that could be a problem as
> the hypervisor could be writing to the efifb resources which ends up
> writing to the same physical memory.  That applies to any GPU on a
> UEFI system.  You'll need to make sure efifb is not in use in the
> hypervisor.

That remark evokes several things to me.  First one is that every time
I've tried booting with efifb disabled in dom0, there was no visible
improvements in the guest driver - i.i. I really have to dig how vram mapping
is performed and check things are as expected anyway.

The other is that, when dom0 cannot use efifb, entering a luks key is
suddenly less user-friendly.  But in theory I'd think we could overcome
this by letting dom0 use efifb until ready to start the guest, a simple
driver unbind at the right moment should be expected to work, right ?
Going further and allowing the guest to use efifb on its own could
possibly be more tricky (starting with a different state?) but does
not seem to sound completely outlandish either - or does it ?

> 
> Alex
> 
> 
> >
> > > But putting efi asside what Alex pointed out pretty much breaks
> > > your
> > > neck trying to forward the device. You maybe could try to hack
> > > the
> > > driver to use the PCIe BAR for framebuffer access, but that might
> > > be
> > > quite a bit slower.
> > >
> > > Regards,
> > > Christian.
> > >
> > > >
> > > >> Alex
> > > >>
> > > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher
> > > >> 
> > > >> wrote:
> > > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson 
> > > >>> wrote:
> > >  Alex wrote:
> > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson
> > > > 
> > > > wrote:
> > > >> Hi Alex,
> > > >>
> > > >>> We have not validated virtualization of our integrated
> > > >>> GPUs.  I
> > > >>> don't
> > > >>> know that it will work at all.  We had done a bit of
> > > >>> testing but
> > > >>> ran
> > > >>> into the same issues with the PSP, but never had a chance
> > > >>> to
> > > >>> debug
> > > >>> further because this feature is not productized.
> > > >> ...
> > > >>> You need a functional PSP to get the GPU driver up and
> > > >>> running.
> > > >> Ah, thanks for the hint :)
> > > >>
> > > >> I guess that if I want to have any chance to get the PSP
> > > >> working
> > > >> I'm
> > > >> going to need more details on it.  A quick search some
> > > 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-21 Thread Alex Deucher
On Sun, Dec 19, 2021 at 11:41 AM Yann Dirson  wrote:
>
> Christian wrote:
> > Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > > Alex wrote:
> > >> Thinking about this more, I think the problem might be related to
> > >> CPU
> > >> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> > >> reserved
> > >> carve out region at the top of system memory.  For CPU access to
> > >> this
> > >> memory, we kmap the physical address of the carve out region of
> > >> system
> > >> memory.  You'll need to make sure that region is accessible to the
> > >> guest.
> > > So basically, the non-virt flow is is: (video?) BIOS reserves
> > > memory, marks it
> > > as reserved in e820, stores the physaddr somewhere, which the GPU
> > > driver gets.
> > > Since I suppose this includes the framebuffer, this probably has to
> > > occur around
> > > the moment the driver calls
> > > drm_aperture_remove_conflicting_pci_framebuffers()
> > > (which happens before this hw init step), right ?
> >
> > Well, that partially correct. The efifb is using the PCIe resources
> > to
> > access the framebuffer and as far as I know we use that one to kick
> > it out.
> >
> > The stolen memory we get over e820/registers is separate to that.
> >
> > > ... which brings me to a point that's been puzzling me for some
> > > time, which is
> > > that as the hw init fails, the efifb driver is still using the
> > > framebuffer.
> >
> > No, it isn't. You are probably just still seeing the same screen.
> >
> > The issue is most likely that while efi was kicked out nobody
> > re-programmed the display hardware to show something different.
> >
> > > Am I right in suspecting that efifb should get stripped of its
> > > ownership of the
> > > fb aperture first, and that if I don't get a black screen on
> > > hw_init failure
> > > that issue should be the first focus point ?
> >
> > You assumption with the black screen is incorrect. Since the hardware
> > works independent even if you kick out efi you still have the same
> > screen content, you just can't update it anymore.
>
> It's not only that the screen keeps its contents, it's that the dom0
> happily continues updating it.

If the hypevisor is using efifb, then yes that could be a problem as
the hypervisor could be writing to the efifb resources which ends up
writing to the same physical memory.  That applies to any GPU on a
UEFI system.  You'll need to make sure efifb is not in use in the
hypervisor.

Alex


>
> > But putting efi asside what Alex pointed out pretty much breaks your
> > neck trying to forward the device. You maybe could try to hack the
> > driver to use the PCIe BAR for framebuffer access, but that might be
> > quite a bit slower.
> >
> > Regards,
> > Christian.
> >
> > >
> > >> Alex
> > >>
> > >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher
> > >> 
> > >> wrote:
> > >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson 
> > >>> wrote:
> >  Alex wrote:
> > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson 
> > > wrote:
> > >> Hi Alex,
> > >>
> > >>> We have not validated virtualization of our integrated
> > >>> GPUs.  I
> > >>> don't
> > >>> know that it will work at all.  We had done a bit of
> > >>> testing but
> > >>> ran
> > >>> into the same issues with the PSP, but never had a chance
> > >>> to
> > >>> debug
> > >>> further because this feature is not productized.
> > >> ...
> > >>> You need a functional PSP to get the GPU driver up and
> > >>> running.
> > >> Ah, thanks for the hint :)
> > >>
> > >> I guess that if I want to have any chance to get the PSP
> > >> working
> > >> I'm
> > >> going to need more details on it.  A quick search some time
> > >> ago
> > >> mostly
> > >> brought reverse-engineering work, rather than official AMD
> > >> doc.
> > >>   Are
> > >> there some AMD resources I missed ?
> > > The driver code is pretty much it.
> >  Let's try to shed some more light on how things work, taking as
> >  excuse
> >  psp_v12_0_ring_create().
> > 
> >  First, register access through [RW]REG32_SOC15() is implemented
> >  in
> >  terms of __[RW]REG32_SOC15_RLC__(), which is basically a
> >  [RW]REG32(),
> >  except it has to be more complex in the SR-IOV case.
> >  Has the RLC anything to do with SR-IOV ?
> > >>> When running the driver on a SR-IOV virtual function (VF), some
> > >>> registers are not available directly via the VF's MMIO aperture
> > >>> so
> > >>> they need to go through the RLC.  For bare metal or passthrough
> > >>> this
> > >>> is not relevant.
> > >>>
> >  It accesses registers in the MMIO range of the MP0 IP, and the
> >  "MP0"
> >  name correlates highly with MMIO accesses in PSP-handling code.
> >  Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0
> >  version
> > >>> Yes.
> > >>>
> >  reported at v11.0.3 by discovery seems to contradict the use of
> >  

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-19 Thread Yann Dirson
Christian wrote:
> Am 19.12.21 um 17:00 schrieb Yann Dirson:
> > Alex wrote:
> >> Thinking about this more, I think the problem might be related to
> >> CPU
> >> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> >> reserved
> >> carve out region at the top of system memory.  For CPU access to
> >> this
> >> memory, we kmap the physical address of the carve out region of
> >> system
> >> memory.  You'll need to make sure that region is accessible to the
> >> guest.
> > So basically, the non-virt flow is is: (video?) BIOS reserves
> > memory, marks it
> > as reserved in e820, stores the physaddr somewhere, which the GPU
> > driver gets.
> > Since I suppose this includes the framebuffer, this probably has to
> > occur around
> > the moment the driver calls
> > drm_aperture_remove_conflicting_pci_framebuffers()
> > (which happens before this hw init step), right ?
> 
> Well, that partially correct. The efifb is using the PCIe resources
> to
> access the framebuffer and as far as I know we use that one to kick
> it out.
> 
> The stolen memory we get over e820/registers is separate to that.
> 
> > ... which brings me to a point that's been puzzling me for some
> > time, which is
> > that as the hw init fails, the efifb driver is still using the
> > framebuffer.
> 
> No, it isn't. You are probably just still seeing the same screen.
> 
> The issue is most likely that while efi was kicked out nobody
> re-programmed the display hardware to show something different.
> 
> > Am I right in suspecting that efifb should get stripped of its
> > ownership of the
> > fb aperture first, and that if I don't get a black screen on
> > hw_init failure
> > that issue should be the first focus point ?
> 
> You assumption with the black screen is incorrect. Since the hardware
> works independent even if you kick out efi you still have the same
> screen content, you just can't update it anymore.

It's not only that the screen keeps its contents, it's that the dom0
happily continues updating it.

> But putting efi asside what Alex pointed out pretty much breaks your
> neck trying to forward the device. You maybe could try to hack the
> driver to use the PCIe BAR for framebuffer access, but that might be
> quite a bit slower.
> 
> Regards,
> Christian.
> 
> >
> >> Alex
> >>
> >> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher
> >> 
> >> wrote:
> >>> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson 
> >>> wrote:
>  Alex wrote:
> > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson 
> > wrote:
> >> Hi Alex,
> >>
> >>> We have not validated virtualization of our integrated
> >>> GPUs.  I
> >>> don't
> >>> know that it will work at all.  We had done a bit of
> >>> testing but
> >>> ran
> >>> into the same issues with the PSP, but never had a chance
> >>> to
> >>> debug
> >>> further because this feature is not productized.
> >> ...
> >>> You need a functional PSP to get the GPU driver up and
> >>> running.
> >> Ah, thanks for the hint :)
> >>
> >> I guess that if I want to have any chance to get the PSP
> >> working
> >> I'm
> >> going to need more details on it.  A quick search some time
> >> ago
> >> mostly
> >> brought reverse-engineering work, rather than official AMD
> >> doc.
> >>   Are
> >> there some AMD resources I missed ?
> > The driver code is pretty much it.
>  Let's try to shed some more light on how things work, taking as
>  excuse
>  psp_v12_0_ring_create().
> 
>  First, register access through [RW]REG32_SOC15() is implemented
>  in
>  terms of __[RW]REG32_SOC15_RLC__(), which is basically a
>  [RW]REG32(),
>  except it has to be more complex in the SR-IOV case.
>  Has the RLC anything to do with SR-IOV ?
> >>> When running the driver on a SR-IOV virtual function (VF), some
> >>> registers are not available directly via the VF's MMIO aperture
> >>> so
> >>> they need to go through the RLC.  For bare metal or passthrough
> >>> this
> >>> is not relevant.
> >>>
>  It accesses registers in the MMIO range of the MP0 IP, and the
>  "MP0"
>  name correlates highly with MMIO accesses in PSP-handling code.
>  Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0
>  version
> >>> Yes.
> >>>
>  reported at v11.0.3 by discovery seems to contradict the use of
>  v12.0
>  for RENOIR as set by soc15_set_ip_blocks(), or do I miss
>  something ?
> >>> Typo in the ip discovery table on renoir.
> >>>
>  More generally (and mostly out of curiosity while we're at it),
>  do we
>  have a way to match IPs listed at discovery time with the ones
>  used
>  in the driver ?
> >>> In general, barring typos, the code is shared at the major
> >>> version
> >>> level.  The actual code may or may not need changes to handle
> >>> minor
> >>> revision changes in an IP.  The driver maps the IP versions from
> >>> the
> >>> ip discovery 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-19 Thread Christian König

Am 19.12.21 um 17:00 schrieb Yann Dirson:

Alex wrote:

Thinking about this more, I think the problem might be related to CPU
access to "VRAM".  APUs don't have dedicated VRAM, they use a
reserved
carve out region at the top of system memory.  For CPU access to this
memory, we kmap the physical address of the carve out region of
system
memory.  You'll need to make sure that region is accessible to the
guest.

So basically, the non-virt flow is is: (video?) BIOS reserves memory, marks it
as reserved in e820, stores the physaddr somewhere, which the GPU driver gets.
Since I suppose this includes the framebuffer, this probably has to occur around
the moment the driver calls drm_aperture_remove_conflicting_pci_framebuffers()
(which happens before this hw init step), right ?


Well, that partially correct. The efifb is using the PCIe resources to 
access the framebuffer and as far as I know we use that one to kick it out.


The stolen memory we get over e820/registers is separate to that.


... which brings me to a point that's been puzzling me for some time, which is
that as the hw init fails, the efifb driver is still using the framebuffer.


No, it isn't. You are probably just still seeing the same screen.

The issue is most likely that while efi was kicked out nobody 
re-programmed the display hardware to show something different.



Am I right in suspecting that efifb should get stripped of its ownership of the
fb aperture first, and that if I don't get a black screen on hw_init failure
that issue should be the first focus point ?


You assumption with the black screen is incorrect. Since the hardware 
works independent even if you kick out efi you still have the same 
screen content, you just can't update it anymore.


But putting efi asside what Alex pointed out pretty much breaks your 
neck trying to forward the device. You maybe could try to hack the 
driver to use the PCIe BAR for framebuffer access, but that might be 
quite a bit slower.


Regards,
Christian.




Alex

On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher 
wrote:

On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson 
wrote:

Alex wrote:

On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson 
wrote:

Hi Alex,


We have not validated virtualization of our integrated
GPUs.  I
don't
know that it will work at all.  We had done a bit of
testing but
ran
into the same issues with the PSP, but never had a chance
to
debug
further because this feature is not productized.

...

You need a functional PSP to get the GPU driver up and
running.

Ah, thanks for the hint :)

I guess that if I want to have any chance to get the PSP
working
I'm
going to need more details on it.  A quick search some time
ago
mostly
brought reverse-engineering work, rather than official AMD
doc.
  Are
there some AMD resources I missed ?

The driver code is pretty much it.

Let's try to shed some more light on how things work, taking as
excuse
psp_v12_0_ring_create().

First, register access through [RW]REG32_SOC15() is implemented
in
terms of __[RW]REG32_SOC15_RLC__(), which is basically a
[RW]REG32(),
except it has to be more complex in the SR-IOV case.
Has the RLC anything to do with SR-IOV ?

When running the driver on a SR-IOV virtual function (VF), some
registers are not available directly via the VF's MMIO aperture so
they need to go through the RLC.  For bare metal or passthrough
this
is not relevant.


It accesses registers in the MMIO range of the MP0 IP, and the
"MP0"
name correlates highly with MMIO accesses in PSP-handling code.
Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0
version

Yes.


reported at v11.0.3 by discovery seems to contradict the use of
v12.0
for RENOIR as set by soc15_set_ip_blocks(), or do I miss
something ?

Typo in the ip discovery table on renoir.


More generally (and mostly out of curiosity while we're at it),
do we
have a way to match IPs listed at discovery time with the ones
used
in the driver ?

In general, barring typos, the code is shared at the major version
level.  The actual code may or may not need changes to handle minor
revision changes in an IP.  The driver maps the IP versions from
the
ip discovery table to the code contained in the driver.


---

As for the register names, maybe we could have a short
explanation of
how they are structured ?  Eg. mmMP0_SMN_C2PMSG_69: that seems to
be
a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not
sure
of the "SMN" part -- that could refer to the "System Management
Network",
described in [0] as an internal bus.  Are we accessing this
register
through this SMN ?

These registers are just mailboxes for the PSP firmware.  All of
the
C2PMSG registers functionality is defined by the PSP firmware.
  They
are basically scratch registers used to communicate between the
driver
and the PSP firmware.




  On APUs, the PSP is shared with
the CPU and the rest of the platform.  The GPU driver just
interacts
with it for a few specific tasks:
1. Loading Trusted Applications (e.g., trusted firmware

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-19 Thread Yann Dirson
Alex wrote:
> Thinking about this more, I think the problem might be related to CPU
> access to "VRAM".  APUs don't have dedicated VRAM, they use a
> reserved
> carve out region at the top of system memory.  For CPU access to this
> memory, we kmap the physical address of the carve out region of
> system
> memory.  You'll need to make sure that region is accessible to the
> guest.

So basically, the non-virt flow is is: (video?) BIOS reserves memory, marks it
as reserved in e820, stores the physaddr somewhere, which the GPU driver gets.
Since I suppose this includes the framebuffer, this probably has to occur around
the moment the driver calls drm_aperture_remove_conflicting_pci_framebuffers()
(which happens before this hw init step), right ?

... which brings me to a point that's been puzzling me for some time, which is
that as the hw init fails, the efifb driver is still using the framebuffer.

Am I right in suspecting that efifb should get stripped of its ownership of the
fb aperture first, and that if I don't get a black screen on hw_init failure
that issue should be the first focus point ?

> 
> Alex
> 
> On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher 
> wrote:
> >
> > On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson 
> > wrote:
> > >
> > > Alex wrote:
> > > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson 
> > > > wrote:
> > > > >
> > > > > Hi Alex,
> > > > >
> > > > > > We have not validated virtualization of our integrated
> > > > > > GPUs.  I
> > > > > > don't
> > > > > > know that it will work at all.  We had done a bit of
> > > > > > testing but
> > > > > > ran
> > > > > > into the same issues with the PSP, but never had a chance
> > > > > > to
> > > > > > debug
> > > > > > further because this feature is not productized.
> > > > > ...
> > > > > > You need a functional PSP to get the GPU driver up and
> > > > > > running.
> > > > >
> > > > > Ah, thanks for the hint :)
> > > > >
> > > > > I guess that if I want to have any chance to get the PSP
> > > > > working
> > > > > I'm
> > > > > going to need more details on it.  A quick search some time
> > > > > ago
> > > > > mostly
> > > > > brought reverse-engineering work, rather than official AMD
> > > > > doc.
> > > > >  Are
> > > > > there some AMD resources I missed ?
> > > >
> > > > The driver code is pretty much it.
> > >
> > > Let's try to shed some more light on how things work, taking as
> > > excuse
> > > psp_v12_0_ring_create().
> > >
> > > First, register access through [RW]REG32_SOC15() is implemented
> > > in
> > > terms of __[RW]REG32_SOC15_RLC__(), which is basically a
> > > [RW]REG32(),
> > > except it has to be more complex in the SR-IOV case.
> > > Has the RLC anything to do with SR-IOV ?
> >
> > When running the driver on a SR-IOV virtual function (VF), some
> > registers are not available directly via the VF's MMIO aperture so
> > they need to go through the RLC.  For bare metal or passthrough
> > this
> > is not relevant.
> >
> > >
> > > It accesses registers in the MMIO range of the MP0 IP, and the
> > > "MP0"
> > > name correlates highly with MMIO accesses in PSP-handling code.
> > > Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0
> > > version
> >
> > Yes.
> >
> > > reported at v11.0.3 by discovery seems to contradict the use of
> > > v12.0
> > > for RENOIR as set by soc15_set_ip_blocks(), or do I miss
> > > something ?
> >
> > Typo in the ip discovery table on renoir.
> >
> > >
> > > More generally (and mostly out of curiosity while we're at it),
> > > do we
> > > have a way to match IPs listed at discovery time with the ones
> > > used
> > > in the driver ?
> >
> > In general, barring typos, the code is shared at the major version
> > level.  The actual code may or may not need changes to handle minor
> > revision changes in an IP.  The driver maps the IP versions from
> > the
> > ip discovery table to the code contained in the driver.
> >
> > >
> > > ---
> > >
> > > As for the register names, maybe we could have a short
> > > explanation of
> > > how they are structured ?  Eg. mmMP0_SMN_C2PMSG_69: that seems to
> > > be
> > > a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not
> > > sure
> > > of the "SMN" part -- that could refer to the "System Management
> > > Network",
> > > described in [0] as an internal bus.  Are we accessing this
> > > register
> > > through this SMN ?
> >
> > These registers are just mailboxes for the PSP firmware.  All of
> > the
> > C2PMSG registers functionality is defined by the PSP firmware.
> >  They
> > are basically scratch registers used to communicate between the
> > driver
> > and the PSP firmware.
> >
> > >
> > >
> > > >  On APUs, the PSP is shared with
> > > > the CPU and the rest of the platform.  The GPU driver just
> > > > interacts
> > > > with it for a few specific tasks:
> > > > 1. Loading Trusted Applications (e.g., trusted firmware
> > > > applications
> > > > that run on the PSP for specific functionality, e.g., HDCP and
> > > > content
> > > > protection, etc.)
> 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-15 Thread Alex Deucher
Thinking about this more, I think the problem might be related to CPU
access to "VRAM".  APUs don't have dedicated VRAM, they use a reserved
carve out region at the top of system memory.  For CPU access to this
memory, we kmap the physical address of the carve out region of system
memory.  You'll need to make sure that region is accessible to the
guest.

Alex

On Mon, Dec 13, 2021 at 3:29 PM Alex Deucher  wrote:
>
> On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson  wrote:
> >
> > Alex wrote:
> > > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson  wrote:
> > > >
> > > > Hi Alex,
> > > >
> > > > > We have not validated virtualization of our integrated GPUs.  I
> > > > > don't
> > > > > know that it will work at all.  We had done a bit of testing but
> > > > > ran
> > > > > into the same issues with the PSP, but never had a chance to
> > > > > debug
> > > > > further because this feature is not productized.
> > > > ...
> > > > > You need a functional PSP to get the GPU driver up and running.
> > > >
> > > > Ah, thanks for the hint :)
> > > >
> > > > I guess that if I want to have any chance to get the PSP working
> > > > I'm
> > > > going to need more details on it.  A quick search some time ago
> > > > mostly
> > > > brought reverse-engineering work, rather than official AMD doc.
> > > >  Are
> > > > there some AMD resources I missed ?
> > >
> > > The driver code is pretty much it.
> >
> > Let's try to shed some more light on how things work, taking as excuse
> > psp_v12_0_ring_create().
> >
> > First, register access through [RW]REG32_SOC15() is implemented in
> > terms of __[RW]REG32_SOC15_RLC__(), which is basically a [RW]REG32(),
> > except it has to be more complex in the SR-IOV case.
> > Has the RLC anything to do with SR-IOV ?
>
> When running the driver on a SR-IOV virtual function (VF), some
> registers are not available directly via the VF's MMIO aperture so
> they need to go through the RLC.  For bare metal or passthrough this
> is not relevant.
>
> >
> > It accesses registers in the MMIO range of the MP0 IP, and the "MP0"
> > name correlates highly with MMIO accesses in PSP-handling code.
> > Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0 version
>
> Yes.
>
> > reported at v11.0.3 by discovery seems to contradict the use of v12.0
> > for RENOIR as set by soc15_set_ip_blocks(), or do I miss something ?
>
> Typo in the ip discovery table on renoir.
>
> >
> > More generally (and mostly out of curiosity while we're at it), do we
> > have a way to match IPs listed at discovery time with the ones used
> > in the driver ?
>
> In general, barring typos, the code is shared at the major version
> level.  The actual code may or may not need changes to handle minor
> revision changes in an IP.  The driver maps the IP versions from the
> ip discovery table to the code contained in the driver.
>
> >
> > ---
> >
> > As for the register names, maybe we could have a short explanation of
> > how they are structured ?  Eg. mmMP0_SMN_C2PMSG_69: that seems to be
> > a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not sure
> > of the "SMN" part -- that could refer to the "System Management Network",
> > described in [0] as an internal bus.  Are we accessing this register
> > through this SMN ?
>
> These registers are just mailboxes for the PSP firmware.  All of the
> C2PMSG registers functionality is defined by the PSP firmware.  They
> are basically scratch registers used to communicate between the driver
> and the PSP firmware.
>
> >
> >
> > >  On APUs, the PSP is shared with
> > > the CPU and the rest of the platform.  The GPU driver just interacts
> > > with it for a few specific tasks:
> > > 1. Loading Trusted Applications (e.g., trusted firmware applications
> > > that run on the PSP for specific functionality, e.g., HDCP and
> > > content
> > > protection, etc.)
> > > 2. Validating and loading firmware for other engines on the SoC.
> > >  This
> > > is required to use those engines.
> >
> > Trying to understand in more details how we start the PSP up, I noticed
> > that psp_v12_0 has support for loading a sOS firmware, but never calls
> > init_sos_microcode() - and anyway there is no sos firmware for renoir
> > and green_sardine, which seem to be the only ASICs with this PSP version.
> > Is it something that's just not been completely wired up yet ?
>
> On APUs, the PSP is shared with the CPU so the PSP firmware is part of
> the sbios image.  The driver doesn't load it.  We only load it on
> dGPUs where the driver is responsible for the chip initialization.
>
> >
> > That also rings a bell, that we have nothing about Secure OS in the doc
> > yet (not even the acronym in the glossary).
> >
> >
> > > I'm not too familiar with the PSP's path to memory from the GPU
> > > perspective.  IIRC, most memory used by the PSP goes through carve
> > > out
> > > "vram" on APUs so it should work, but I would double check if there
> > > are any system memory allocations that used to interact with the PSP
> > > 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-13 Thread Alex Deucher
On Sun, Dec 12, 2021 at 5:19 PM Yann Dirson  wrote:
>
> Alex wrote:
> > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson  wrote:
> > >
> > > Hi Alex,
> > >
> > > > We have not validated virtualization of our integrated GPUs.  I
> > > > don't
> > > > know that it will work at all.  We had done a bit of testing but
> > > > ran
> > > > into the same issues with the PSP, but never had a chance to
> > > > debug
> > > > further because this feature is not productized.
> > > ...
> > > > You need a functional PSP to get the GPU driver up and running.
> > >
> > > Ah, thanks for the hint :)
> > >
> > > I guess that if I want to have any chance to get the PSP working
> > > I'm
> > > going to need more details on it.  A quick search some time ago
> > > mostly
> > > brought reverse-engineering work, rather than official AMD doc.
> > >  Are
> > > there some AMD resources I missed ?
> >
> > The driver code is pretty much it.
>
> Let's try to shed some more light on how things work, taking as excuse
> psp_v12_0_ring_create().
>
> First, register access through [RW]REG32_SOC15() is implemented in
> terms of __[RW]REG32_SOC15_RLC__(), which is basically a [RW]REG32(),
> except it has to be more complex in the SR-IOV case.
> Has the RLC anything to do with SR-IOV ?

When running the driver on a SR-IOV virtual function (VF), some
registers are not available directly via the VF's MMIO aperture so
they need to go through the RLC.  For bare metal or passthrough this
is not relevant.

>
> It accesses registers in the MMIO range of the MP0 IP, and the "MP0"
> name correlates highly with MMIO accesses in PSP-handling code.
> Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0 version

Yes.

> reported at v11.0.3 by discovery seems to contradict the use of v12.0
> for RENOIR as set by soc15_set_ip_blocks(), or do I miss something ?

Typo in the ip discovery table on renoir.

>
> More generally (and mostly out of curiosity while we're at it), do we
> have a way to match IPs listed at discovery time with the ones used
> in the driver ?

In general, barring typos, the code is shared at the major version
level.  The actual code may or may not need changes to handle minor
revision changes in an IP.  The driver maps the IP versions from the
ip discovery table to the code contained in the driver.

>
> ---
>
> As for the register names, maybe we could have a short explanation of
> how they are structured ?  Eg. mmMP0_SMN_C2PMSG_69: that seems to be
> a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not sure
> of the "SMN" part -- that could refer to the "System Management Network",
> described in [0] as an internal bus.  Are we accessing this register
> through this SMN ?

These registers are just mailboxes for the PSP firmware.  All of the
C2PMSG registers functionality is defined by the PSP firmware.  They
are basically scratch registers used to communicate between the driver
and the PSP firmware.

>
>
> >  On APUs, the PSP is shared with
> > the CPU and the rest of the platform.  The GPU driver just interacts
> > with it for a few specific tasks:
> > 1. Loading Trusted Applications (e.g., trusted firmware applications
> > that run on the PSP for specific functionality, e.g., HDCP and
> > content
> > protection, etc.)
> > 2. Validating and loading firmware for other engines on the SoC.
> >  This
> > is required to use those engines.
>
> Trying to understand in more details how we start the PSP up, I noticed
> that psp_v12_0 has support for loading a sOS firmware, but never calls
> init_sos_microcode() - and anyway there is no sos firmware for renoir
> and green_sardine, which seem to be the only ASICs with this PSP version.
> Is it something that's just not been completely wired up yet ?

On APUs, the PSP is shared with the CPU so the PSP firmware is part of
the sbios image.  The driver doesn't load it.  We only load it on
dGPUs where the driver is responsible for the chip initialization.

>
> That also rings a bell, that we have nothing about Secure OS in the doc
> yet (not even the acronym in the glossary).
>
>
> > I'm not too familiar with the PSP's path to memory from the GPU
> > perspective.  IIRC, most memory used by the PSP goes through carve
> > out
> > "vram" on APUs so it should work, but I would double check if there
> > are any system memory allocations that used to interact with the PSP
> > and see if changing them to vram helps.  It does work with the IOMMU
> > enabled on bare metal, so it should work in passthrough as well in
> > theory.
>
> I can see a single case in the PSP code where GTT is used instead of
> vram: to create fw_pri_bo when SR-IOV is not used (and there has
> to be a reason, since the SR-IOV code path does use vram).
> Changing it to vram does not make a difference, but then the
> only bo that seems to be used at that point is the one for the psp ring,
> which is allocated in vram, so I'm not too much surprised.
>
> Maybe I should double-check bo_create calls to hunt for more ?

We looked 

Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-12 Thread Yann Dirson
Alex wrote:
> On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson  wrote:
> >
> > Hi Alex,
> >
> > > We have not validated virtualization of our integrated GPUs.  I
> > > don't
> > > know that it will work at all.  We had done a bit of testing but
> > > ran
> > > into the same issues with the PSP, but never had a chance to
> > > debug
> > > further because this feature is not productized.
> > ...
> > > You need a functional PSP to get the GPU driver up and running.
> >
> > Ah, thanks for the hint :)
> >
> > I guess that if I want to have any chance to get the PSP working
> > I'm
> > going to need more details on it.  A quick search some time ago
> > mostly
> > brought reverse-engineering work, rather than official AMD doc.
> >  Are
> > there some AMD resources I missed ?
> 
> The driver code is pretty much it.

Let's try to shed some more light on how things work, taking as excuse
psp_v12_0_ring_create().

First, register access through [RW]REG32_SOC15() is implemented in
terms of __[RW]REG32_SOC15_RLC__(), which is basically a [RW]REG32(),
except it has to be more complex in the SR-IOV case.
Has the RLC anything to do with SR-IOV ?

It accesses registers in the MMIO range of the MP0 IP, and the "MP0"
name correlates highly with MMIO accesses in PSP-handling code.
Is "MP0" another name for PSP (and "MP1" for SMU) ?  The MP0 version
reported at v11.0.3 by discovery seems to contradict the use of v12.0
for RENOIR as set by soc15_set_ip_blocks(), or do I miss something ?

More generally (and mostly out of curiosity while we're at it), do we
have a way to match IPs listed at discovery time with the ones used
in the driver ?

---

As for the register names, maybe we could have a short explanation of
how they are structured ?  Eg. mmMP0_SMN_C2PMSG_69: that seems to be
a MMIO register named "C2PMSG_69" in the "MP0" IP, but I'm not sure
of the "SMN" part -- that could refer to the "System Management Network",
described in [0] as an internal bus.  Are we accessing this register
through this SMN ?


>  On APUs, the PSP is shared with
> the CPU and the rest of the platform.  The GPU driver just interacts
> with it for a few specific tasks:
> 1. Loading Trusted Applications (e.g., trusted firmware applications
> that run on the PSP for specific functionality, e.g., HDCP and
> content
> protection, etc.)
> 2. Validating and loading firmware for other engines on the SoC.
>  This
> is required to use those engines.

Trying to understand in more details how we start the PSP up, I noticed
that psp_v12_0 has support for loading a sOS firmware, but never calls
init_sos_microcode() - and anyway there is no sos firmware for renoir
and green_sardine, which seem to be the only ASICs with this PSP version.
Is it something that's just not been completely wired up yet ?

That also rings a bell, that we have nothing about Secure OS in the doc
yet (not even the acronym in the glossary).


> I'm not too familiar with the PSP's path to memory from the GPU
> perspective.  IIRC, most memory used by the PSP goes through carve
> out
> "vram" on APUs so it should work, but I would double check if there
> are any system memory allocations that used to interact with the PSP
> and see if changing them to vram helps.  It does work with the IOMMU
> enabled on bare metal, so it should work in passthrough as well in
> theory.

I can see a single case in the PSP code where GTT is used instead of
vram: to create fw_pri_bo when SR-IOV is not used (and there has
to be a reason, since the SR-IOV code path does use vram).
Changing it to vram does not make a difference, but then the
only bo that seems to be used at that point is the one for the psp ring,
which is allocated in vram, so I'm not too much surprised.

Maybe I should double-check bo_create calls to hunt for more ?


[0] 
https://github.com/PSPReverse/psp-docs/blob/master/masterthesis-eichner-psp-2020.pdf


Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-09 Thread Alex Deucher
On Wed, Dec 8, 2021 at 5:50 PM Yann Dirson  wrote:
>
> Hi Alex,
>
> >
> > On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson  wrote:
> > >
> > > Hi Alex,
> > >
> > > > We have not validated virtualization of our integrated GPUs.  I
> > > > don't
> > > > know that it will work at all.  We had done a bit of testing but
> > > > ran
> > > > into the same issues with the PSP, but never had a chance to
> > > > debug
> > > > further because this feature is not productized.
> > > ...
> > > > You need a functional PSP to get the GPU driver up and running.
> > >
> > > Ah, thanks for the hint :)
> > >
> > > I guess that if I want to have any chance to get the PSP working
> > > I'm
> > > going to need more details on it.  A quick search some time ago
> > > mostly
> > > brought reverse-engineering work, rather than official AMD doc.
> > >  Are
> > > there some AMD resources I missed ?
> >
> > The driver code is pretty much it.  On APUs, the PSP is shared with
> > the CPU and the rest of the platform.  The GPU driver just interacts
> > with it for a few specific tasks:
> > 1. Loading Trusted Applications (e.g., trusted firmware applications
> > that run on the PSP for specific functionality, e.g., HDCP and
> > content
> > protection, etc.)
> > 2. Validating and loading firmware for other engines on the SoC.
> >  This
> > is required to use those engines.
>
> After some digging, if I understand correctly, the PSP is the 3rd IP
> getting its hw_init() called.  First comes soc15_common, then vega10_ih.
>
> - soc15_common_init_hw does some writes through nbio_v7.0 functions,
>   but does not query the hw to check before or after
> - vega10_init_hw does some register reads as part of its work, but once
>   it has written it does not check either
>
> So PSP is the first one to check that "soc15" (I'm still not sure what
> this one represents, really) is in fact alive and well.
>
> Can't we check earlier that the chip is really listening to us ?

Each SoC is made up of hardware blocks that provide various different
functionality.  They are mostly independent and mostly initialized
independently.  I'm not sure what you would want to check.  In your
case, I don't think it's an issue of the chip not being functional
overall, but rather a problem specific to the failing block somehow
related to being in a virtualized environment.

Alex


>
> >
> > I'm not too familiar with the PSP's path to memory from the GPU
> > perspective.  IIRC, most memory used by the PSP goes through carve
> > out
> > "vram" on APUs so it should work, but I would double check if there
> > are any system memory allocations that used to interact with the PSP
> > and see if changing them to vram helps.  It does work with the IOMMU
> > enabled on bare metal, so it should work in passthrough as well in
> > theory.
> >
> > Alex
> >
> >
> > >
> > > Best regards,
> > > --
> > > Yann
> >


Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-08 Thread Yann Dirson
Hi Alex,

> 
> On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson  wrote:
> >
> > Hi Alex,
> >
> > > We have not validated virtualization of our integrated GPUs.  I
> > > don't
> > > know that it will work at all.  We had done a bit of testing but
> > > ran
> > > into the same issues with the PSP, but never had a chance to
> > > debug
> > > further because this feature is not productized.
> > ...
> > > You need a functional PSP to get the GPU driver up and running.
> >
> > Ah, thanks for the hint :)
> >
> > I guess that if I want to have any chance to get the PSP working
> > I'm
> > going to need more details on it.  A quick search some time ago
> > mostly
> > brought reverse-engineering work, rather than official AMD doc.
> >  Are
> > there some AMD resources I missed ?
> 
> The driver code is pretty much it.  On APUs, the PSP is shared with
> the CPU and the rest of the platform.  The GPU driver just interacts
> with it for a few specific tasks:
> 1. Loading Trusted Applications (e.g., trusted firmware applications
> that run on the PSP for specific functionality, e.g., HDCP and
> content
> protection, etc.)
> 2. Validating and loading firmware for other engines on the SoC.
>  This
> is required to use those engines.

After some digging, if I understand correctly, the PSP is the 3rd IP
getting its hw_init() called.  First comes soc15_common, then vega10_ih.

- soc15_common_init_hw does some writes through nbio_v7.0 functions,
  but does not query the hw to check before or after
- vega10_init_hw does some register reads as part of its work, but once
  it has written it does not check either

So PSP is the first one to check that "soc15" (I'm still not sure what
this one represents, really) is in fact alive and well.

Can't we check earlier that the chip is really listening to us ?

> 
> I'm not too familiar with the PSP's path to memory from the GPU
> perspective.  IIRC, most memory used by the PSP goes through carve
> out
> "vram" on APUs so it should work, but I would double check if there
> are any system memory allocations that used to interact with the PSP
> and see if changing them to vram helps.  It does work with the IOMMU
> enabled on bare metal, so it should work in passthrough as well in
> theory.
> 
> Alex
> 
> 
> >
> > Best regards,
> > --
> > Yann
> 


Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-06 Thread Alex Deucher
On Mon, Dec 6, 2021 at 4:36 PM Yann Dirson  wrote:
>
> Hi Alex,
>
> > We have not validated virtualization of our integrated GPUs.  I don't
> > know that it will work at all.  We had done a bit of testing but ran
> > into the same issues with the PSP, but never had a chance to debug
> > further because this feature is not productized.
> ...
> > You need a functional PSP to get the GPU driver up and running.
>
> Ah, thanks for the hint :)
>
> I guess that if I want to have any chance to get the PSP working I'm
> going to need more details on it.  A quick search some time ago mostly
> brought reverse-engineering work, rather than official AMD doc.  Are
> there some AMD resources I missed ?

The driver code is pretty much it.  On APUs, the PSP is shared with
the CPU and the rest of the platform.  The GPU driver just interacts
with it for a few specific tasks:
1. Loading Trusted Applications (e.g., trusted firmware applications
that run on the PSP for specific functionality, e.g., HDCP and content
protection, etc.)
2. Validating and loading firmware for other engines on the SoC.  This
is required to use those engines.

I'm not too familiar with the PSP's path to memory from the GPU
perspective.  IIRC, most memory used by the PSP goes through carve out
"vram" on APUs so it should work, but I would double check if there
are any system memory allocations that used to interact with the PSP
and see if changing them to vram helps.  It does work with the IOMMU
enabled on bare metal, so it should work in passthrough as well in
theory.

Alex


>
> Best regards,
> --
> Yann


Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-06 Thread Yann Dirson
Hi Alex,

> We have not validated virtualization of our integrated GPUs.  I don't
> know that it will work at all.  We had done a bit of testing but ran
> into the same issues with the PSP, but never had a chance to debug
> further because this feature is not productized.
...
> You need a functional PSP to get the GPU driver up and running.

Ah, thanks for the hint :)

I guess that if I want to have any chance to get the PSP working I'm
going to need more details on it.  A quick search some time ago mostly
brought reverse-engineering work, rather than official AMD doc.  Are
there some AMD resources I missed ?

Best regards,
-- 
Yann


Re: Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-12-06 Thread Alex Deucher
On Sat, Nov 27, 2021 at 11:28 AM  wrote:
>
> Hello,
>
> Xen passthrough of a boot GPU those days (at least in the small QubesOS world)
> is mostly tested/documented for Intel iGPUs (or I missed something).
> I've been trying to do that with a Renoir GPU (for context, the goal is
> to have a xen domU dedicated to the GUI [3]).  I won't go into all the details
> of my successive attempts in this email, various (relative) progress reports 
> are
> available at [0] (there are other things to be investigated listed there, but
> at least some of them can possibly wait).  And I have surely missed more than
> a couple of key points.

We have not validated virtualization of our integrated GPUs.  I don't
know that it will work at all.  We had done a bit of testing but ran
into the same issues with the PSP, but never had a chance to debug
further because this feature is not productized.

>
> Summary of the setup:
> - GPU protected from dom0 driver using pci-stub (gets access to the GPU via 
> efifb
>   until hopefully the GUI domain seizes it)
> - host is Xen 4.14, dom0 uses Linux 5.14 (Qubes' kernel-latest)
> - guest is a Xen HVM with running in a stub domain, launched through 
> libvirt/libxl
> - hackish enablement of the IGD passthrough codepaths through
>   - libxl PCI VID hack: 
> https://github.com/ydirson/xen/commit/4c9d4cb5c3dc1282ba83f17d15072c197b60281c
>   - qemu BDF hack: 
> https://github.com/ydirson/qemu/commit/6a165467e25864f1ae17390a44a9c1425ba67aed
>
> The first problem encountered, i.e. that the guest amdgpu driver was not able
> to access the PCI expansion ROM, I have hacked around for now by letting the
> driver load as firmware a copy of the ROMdriver [1] - this was a 5.14.15 
> kernel
> with the QubesOS patches (all reachable from this commit).
>
> Doing this seems to make the driver happy on this aspect, but several issues
> now become visible, and after some digging I feel some insights from people
> familiar with the code gets really necessary :)
>
> The first problems are shown below as [T0], my interpretation being:
> 1. Xorg aborts (audit: type=1701) -- should find a way to get more details, 
> but
>that is surely not the root cause of what follows
> 2. a PSP command fails -- I cannot find any AMD documentation on how PSP 
> works,
>that could possibly help
> 3. the PSP fails to load some firmware as part of its own init -- here I'm 
> quite
>uncomfortable, I thought of the PSP as being distinct from the cpu cores 
> and
>gpu, but here it appears as a disting IP *within* the gpu.  I also failed 
> to
>find any detailed description of the whole stuff and their interactions.
> 4. following this failure the driver finishes (while initialization was still
>ongoing)

You need a functional PSP to get the GPU driver up and running.

> 5. then vcn_v2_0_sw_fini() triggers a bad memory access, which appeared to be
>while dereferencing adev->vcn.inst->fw_shared_cpu_addr.
>
> After adding traces on the individual IPs init/fini [2] showed that the vcn
> sw_init was indeed run, and likely initialized this pointer.  Any idea how
> it became invalid ?  One track I briefly followed was that some of the IP
> init appears to be asynchronous (the failure in PSP init occurs after later
> IPs get initialized), but that pointer seems to be initialized early and
> synchronously by VCN sw_init.
>
>
> Then, to workaround the problem with PSP not being able to initialized, I used
> fw_load_type=0 to use direct loading (and noted that fw_load_type=1, 
> advertised
> as loading firmware using SMU, just does not do anything in the code).

That will not work on modern GPUs.  The PSP is required for firmware
loading.  Without firmware the various engines on the GPU (GFX,
compute, VCN) won't work.

>
> The result, using 5.15.4 at this time, resulted in trace [T1].  The error 
> surfacing
> now is "ring kiq_2.1.0 test failed" with a timeout.  I had to dig the kernel 
> commit
> messages to discover that KIQ is a Kernel Interface Queue, and there are 
> various
> other acronyms around this (eg. "eop", whose introduction seems older than the
> landing of the driver in the kernel) which really make it hard to be 
> efficient at
> understanding the code.  Will gladly be enlightened :)
>
> And this also ends with the VCN sw_fini going fireworks, and a quick look at 
> the
> assembler seems to hint that although the code changed a bit, it is still the
> same statement crashing.
>
> Also noticed that ip_block_mask=0xfff7 to disable the PSP on this ASIC will 
> do slightly
> different things, but end up with the same errors.
>
>
> I will gladly take any suggestion, pointers to additional information, etc :)

PSP is fundamental to the operation of the GPU.

Alex


>
> Best regards,
> --
> Yann
>
>
> [0] https://forum.qubes-os.org/t/amd-igpu-passthrough-attempt/6766/
> [1] 
> https://github.com/ydirson/linux/commit/4ca50829aa44b29e8428328e913a0546568bf1c0
> [2] 
> 

Various problems trying to vga-passthrough a Renoir iGPU to a xen/qubes-os hvm

2021-11-27 Thread ydirson
Hello,

Xen passthrough of a boot GPU those days (at least in the small QubesOS world)
is mostly tested/documented for Intel iGPUs (or I missed something).
I've been trying to do that with a Renoir GPU (for context, the goal is
to have a xen domU dedicated to the GUI [3]).  I won't go into all the details
of my successive attempts in this email, various (relative) progress reports are
available at [0] (there are other things to be investigated listed there, but
at least some of them can possibly wait).  And I have surely missed more than
a couple of key points.

Summary of the setup:
- GPU protected from dom0 driver using pci-stub (gets access to the GPU via 
efifb
  until hopefully the GUI domain seizes it)
- host is Xen 4.14, dom0 uses Linux 5.14 (Qubes' kernel-latest)
- guest is a Xen HVM with running in a stub domain, launched through 
libvirt/libxl
- hackish enablement of the IGD passthrough codepaths through
  - libxl PCI VID hack: 
https://github.com/ydirson/xen/commit/4c9d4cb5c3dc1282ba83f17d15072c197b60281c
  - qemu BDF hack: 
https://github.com/ydirson/qemu/commit/6a165467e25864f1ae17390a44a9c1425ba67aed

The first problem encountered, i.e. that the guest amdgpu driver was not able
to access the PCI expansion ROM, I have hacked around for now by letting the
driver load as firmware a copy of the ROMdriver [1] - this was a 5.14.15 kernel
with the QubesOS patches (all reachable from this commit).

Doing this seems to make the driver happy on this aspect, but several issues
now become visible, and after some digging I feel some insights from people
familiar with the code gets really necessary :)

The first problems are shown below as [T0], my interpretation being:
1. Xorg aborts (audit: type=1701) -- should find a way to get more details, but
   that is surely not the root cause of what follows
2. a PSP command fails -- I cannot find any AMD documentation on how PSP works,
   that could possibly help
3. the PSP fails to load some firmware as part of its own init -- here I'm quite
   uncomfortable, I thought of the PSP as being distinct from the cpu cores and
   gpu, but here it appears as a disting IP *within* the gpu.  I also failed to
   find any detailed description of the whole stuff and their interactions.
4. following this failure the driver finishes (while initialization was still
   ongoing)
5. then vcn_v2_0_sw_fini() triggers a bad memory access, which appeared to be
   while dereferencing adev->vcn.inst->fw_shared_cpu_addr.

After adding traces on the individual IPs init/fini [2] showed that the vcn
sw_init was indeed run, and likely initialized this pointer.  Any idea how
it became invalid ?  One track I briefly followed was that some of the IP
init appears to be asynchronous (the failure in PSP init occurs after later
IPs get initialized), but that pointer seems to be initialized early and
synchronously by VCN sw_init.


Then, to workaround the problem with PSP not being able to initialized, I used
fw_load_type=0 to use direct loading (and noted that fw_load_type=1, advertised
as loading firmware using SMU, just does not do anything in the code).

The result, using 5.15.4 at this time, resulted in trace [T1].  The error 
surfacing
now is "ring kiq_2.1.0 test failed" with a timeout.  I had to dig the kernel 
commit
messages to discover that KIQ is a Kernel Interface Queue, and there are various
other acronyms around this (eg. "eop", whose introduction seems older than the
landing of the driver in the kernel) which really make it hard to be efficient 
at
understanding the code.  Will gladly be enlightened :)

And this also ends with the VCN sw_fini going fireworks, and a quick look at the
assembler seems to hint that although the code changed a bit, it is still the
same statement crashing.

Also noticed that ip_block_mask=0xfff7 to disable the PSP on this ASIC will do 
slightly
different things, but end up with the same errors.


I will gladly take any suggestion, pointers to additional information, etc :)

Best regards,
-- 
Yann


[0] https://forum.qubes-os.org/t/amd-igpu-passthrough-attempt/6766/
[1] 
https://github.com/ydirson/linux/commit/4ca50829aa44b29e8428328e913a0546568bf1c0
[2] 
https://github.com/ydirson/linux/commit/87004f9542b9a80b4fb838697312778cf47e4146
[3] 
https://www.qubes-os.org/news/2020/03/18/gui-domain/#gpu-passthrough-the-perfect-world-desktop-solution

[T0] 

[2021-11-23 21:05:52] [4.297684] amdgpu :00:05.0: amdgpu: Fetched VBIOS 
from firmware file
[2021-11-23 21:05:52] [4.297709] amdgpu: ATOM BIOS: 113-RENOIR-025
[2021-11-23 21:05:52] [4.302046] [drm] VCN decode is enabled in VM mode
[2021-11-23 21:05:52] [4.302066] [drm] VCN encode is enabled in VM mode
[2021-11-23 21:05:52] [4.302078] [drm] JPEG decode is enabled in VM mode
[2021-11-23 21:05:52] [4.302144] [drm] vm size is 262144 GB, 4 levels, 
block size is 9-bit, fragment size is 9-bit
[2021-11-23 21:05:52] [4.302181] amdgpu :00:05.0: amdgpu: VRAM: 512M 
0x00F4 -