-----Original Message-----
From: Alex Williamson [mailto:[email protected]] 
Sent: Wednesday, May 24, 2017 11:57 PM
To: Wang, Zhi A <[email protected]>
Cc: Dong, Chuanxiao <[email protected]>; Daniel Vetter 
<[email protected]>; Zhang, Xiong Y <[email protected]>; Joonas 
Lahtinen <[email protected]>; Chris Wilson 
<[email protected]>; Lv, Zhiyuan <[email protected]>; Zhenyu Wang 
<[email protected]>; Tian, Kevin <[email protected]>
Subject: Re: Deal with stolen memory in GVT-d (passthrough)

On Wed, 24 May 2017 13:24:55 +0800
Zhi Wang <[email protected]> wrote:

> On 05/24/17 10:33, Alex Williamson wrote:
> > On Wed, 24 May 2017 09:10:23 +0800
> > Zhi Wang <[email protected]> wrote:
> >  
> >> On 05/24/17 01:01, Alex Williamson wrote:  
> >>> On Tue, 23 May 2017 17:14:53 +0800 Zhi Wang <[email protected]> 
> >>> wrote:
> >>>     
> >>>> Hi All:
> >>>>        We did an investigation for the further directions. First, 
> >>>> Alex, do you wish us to support exposing stolen memory through 
> >>>> RMRR in QEMU IOMMU emulation? Suppose this is a nested 
> >>>> virtualization case: QEMU IOMMU emulation expose the stolen memory 
> >>>> region through RMRR to L2 guest?
> >>> Yes, if the guest has a vIOMMU then the stolen memory mapping to 
> >>> the device needs to be protected, via an RMRR if the vIOMMU is 
> >>> VT-d.  I don't see how nesting adds any additional complication 
> >>> beyond regular IOMMU support in the guest.
> >>>     
> >>>> For exposing stolen memory in L1 guest, Xiong and I found several opens:
> >>>>
> >>>> As we are going to implement RMRR as a special VFIO region, 
> >>>> suppose QEMU would obtain the information of stolen memory region via 
> >>>> VFIO ioctls.
> >>>> One problem is currently the memory layout would be initialized 
> >>>> earlier than vfio device realization. After the memory layout is 
> >>>> fixed in machine initialization, if later the host stolen memory 
> >>>> base (H-GSM
> >>>> BASE) is falling in the guest RAM region, we got a overlap problem.
> >>>>
> >>>>    From our point of view, what we can do are:
> >>>>
> >>>> choice a) Adjust the memory layout in vfio_realize(). But it 
> >>>> would be complicated and buggy as no one has this requirement before.
> >>> I don't think QEMU would be in favor of devices manipulating the 
> >>> machine layout.
> >>>     
> >>>> choice b) Query the vfio device stolen memory region in
> >>>> vfio_instance_post_init() and reserve the GSM BASE at this time.
> >>>> choice c) Add a new command line option to "vfio-pci" device, 
> >>>> user can specify the GSM BASE (from kernel VFIO driver) and QEMU 
> >>>> reserves it in vfio_instance_post_init().
> >>> c) seems like a constant source of user confusion.  Where is the 
> >>> user going to learn about the stolen memory base address in order 
> >>> to properly configure their VM?
> >>>
> >>> b) also doesn't seem particularly viable since we only understand 
> >>> the device we're working with after opening the device, which we 
> >>> cannot do without a lot of setup, which is not done by this point.
> >> For b) If we are going to walk this way, I suppose
> >>                * we will open the vfio device in 
> >> instance_post_init() then close it
> >>                * or we move some code from vfio_realize() into 
> >> instance_post_init().
> > I don't think this is going to work, it's abusing the entire QEMU 
> > device infrastructure to meddle with the machine layout for a device 
> > with ill conceived requirements.
> That's also my concern. :( I just put this option on the table. :P
> >>> Choice c) seems like the least bad option (this is why not being 
> >>> "just a PCI device" is so hard to deal with), but this should 
> >>> really be discussed on qemu-devel, maybe there are better ideas 
> >>> there.  Thanks,
> >> For c) my idea is vfio can expose some region info in sysfs. so 
> >> user could know how to fill the information of stolen memory, and 
> >> we can check that in the vfio_realize().
> > vfio_realize()?  If we could wait until then we wouldn't need sysfs.
> > vfio devices have no representation in sysfs nor am I particularly 
> > fond of creating one.  Thanks,
> Sorry for the confusing. I mean user can pass the GSM base and size 
> through qemu-command line, but we still need to check if the 
> configuration from user is the same as the configuration from VFIO 
> stolen memory region in vfio_realize().
> 
> For how user could get the GSM base and size, it's just an rough idea, 
> there should be many possible ways, like user could get or read it 
> from pci device node in sysfs via a script. Let's see if we can have 
> some graceful way to do that.
> 
> Uh. Literally, qemu just need to check if the user passes the right 
> GSM base and size, right?  No matter how he get it. :)

[Alex]That's true, we can validate it against the device once we get to that 
point, though it becomes difficult to justify to users why they need to go 
figure it out themselves given that incongruity.  Are there any constraints on 
the host system for where stolen memory is placed?  IIRC it's a 1MB aligned 
address, is it a 32bit mapping or can it be 64bit?

[Zhi] According to this pdf:
http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/desktop-6th-gen-core-family-datasheet-vol-2.pdf

        4.24 Base Data of Stolen Memory (BDSM)-Offset 5Ch

        This register contains the base address of graphics data stolen DRAM 
memory. BIOS
        determines the base of graphics data stolen memory by subtracting the 
graphics data
        stolen memory size (PCI Device 0 offset 52 bits 7:4) from TOLUD (PCI 
Device 0 offset
        BC bits 31:20).

The stolen memory stays only below TOLUD (Top of low usable memory)[1], it must 
be a 32bit base PA.

[Alex]If there are any constraints that would make it relatively compatible 
with a QEMU VM already, we might have the option of simply marking the range 
reserved and ignoring that we're wasting that VM memory.Wasteful, but perhaps 
easier than changing the VM memory map. 

[Zhi] Yes, that's a great option. :) I put some detailed information in 
approach B below.

[Alex]We also have the option of using the IOMMU to map VM memory to the stolen 
memory IOVA such that the VM has their own stolen memory space.

[Zhi] That's a good point. But some function blocks in GEN will not care IOMMU, 
like GuC. So we can do that, but there might be a naughty HW breaking our magic.

Yeah we can leave the host stolen memory for those naughty function blocks in 
GEN and use IOMMU to map VM memory to the IOVA = HGSM BASE. That looks better?

If function blocks honor IOMMU, they can still use VM-dedicated stolen memory 
as the IOVA = GSM BASE has been mapped into VM stolen memory by IOMMU. Then we 
don't need to care about changing VM memory layout. For function blocks don't 
honor IOMMU, they are just directly access host stolen memory. :(

Guest still have a change to sniff the information in the host stolen memory, 
only if it knows how to manipulate the HW function blocks which don't honor 
IOMMU.

Looks the isolation is still not perfect.

[Alex] Again, this is perhaps viewed as wasteful, but I can only presume that 
stolen memory is not cleared on IGD FLR, so there might be a security advantage 
to avoid granting the user access to the host stolen memory. Otherwise vfio 
would likely need to explicitly memset stolen memory when opening and releasing 
the device.

[Zhi] I totally agree stolen memory should be memset to zero. :P. e.g. If one 
guest uses a part of the stolen memory as framebuffer, and suddenly shutdown. 
After that, another guest boots up, sniffs and saves the framebuffer into a 
picture file, then it could know the screen content from previous guest. One 
more concern is I remember some SW/HW would rely on the data (from BIOS) in the 
stolen memory to do the configuration, If this is the case, then we can clear 
the stolen memory selectively. Would that be an option? :)

[Alex] Also be aware that reading more than the first 64bytes of config space 
on a device requires privileges, so if QEMU starts poking around in sysfs to 
learn the stolen memory size and location, that means that libvirt would need 
to grant QEMU sufficient privileges to do that. Thanks,

[Zhi] Yes. I thought about that before, if an ordinary VFIO pass-through 
doesn't need higher privileges, but IGD pass-through needs that, it's a 
deployment burden.

Looks our steps are:

1) Allocate VM memory from GPA = HGSM BASE for guest stolen memory. Mostly this 
is for guest which is able to populate its GPU page table based on the same 
IOVA = GPA mapping as we have to let host GSM BASE = guest GSM BASE. (But I 
think we can grant a smallest amount of stolen memory :P)

In hw/vfio/pci-quirks.c:

Approach a: 

Suppose guest wouldn't directly read/write a lot of data from stolen memory as 
it has been marked as E820_RESERVED, maybe we can allocate a new MemoryRegion 
then add a memory_region_init_io() + memory_region_add_subregion_overlap() in 
IGD quirk

Approach b:

If we are going to avoid the trap above:
        - Case A: [HGSM BASE, HGSM BASE + HGSM SIZE) fully falls into the guest 
ram.
                We reserve that portion in E820
        - Case B: [HGSM BASE, HGSM BASE + HGSM SIZE) partially falls into the 
guest ram.
                We allocate the missing amount of guest ram and link it after 
the end of ram below 4G.
        - Case C: [HGSM BASE, HGSM BASE + HGSM SIZE) falls into a non-guest ram 
range.
                We allocate a new portion of guest ram.

Or no matter what, we just allocate a new ram/new MemoryRegion and add it into 
system memory space then bump up its priority higher than system ram.
Then use e820_add_entry() to reserve that guest ram as stolen memory.

2) Map IOVA = HGSM BASE identifiably in IOMMU (using the VM dedicated stolen 
memory above) for those functions honor IOMMU. Might copy some configuration 
from host stolen memory if necessary.

3) For those HW functions which don't honor IOMMU, we check if there is any 
security vulnerability.

4) Memset host stolen memory when open/release VFIO device (HW functions which 
don't honor IOMMU might leaks some information here. The smallest amount of 
stolen memory costs lesser time here)

Feel free to let me know your ideas and concern.

Thanks,
Zhi.

[1] Refer to Section 3.37 for introduction to TOLUD.
  
_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Reply via email to