Hi Alex, On 13/03/18 17:56, Alex Williamson wrote: > [Cc +Eric] > > On Tue, 13 Mar 2018 15:53:19 +1100 > Alexey Kardashevskiy <a...@ozlabs.ru> wrote: > >> On 7/3/18 1:17 pm, Alexey Kardashevskiy wrote: >>> On 26/02/18 19:36, Alexey Kardashevskiy wrote: >>>> On 19/02/18 13:46, Alexey Kardashevskiy wrote: >>>>> On 16/02/18 16:28, David Gibson wrote: >>>>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote: >>>>>>> On Wed, 14 Feb 2018 19:09:16 +1100 >>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>>>>>> >>>>>>>> On 14/02/18 12:33, David Gibson wrote: >>>>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>> >>>>>>>>>> On 13/02/18 16:41, David Gibson wrote: >>>>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote: >>>>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy >>>>>>>>>>>> wrote: >>>>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote: >>>>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100 >>>>>>>>>>>>>> Alexey Kardashevskiy <a...@ozlabs.ru> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote: >>>>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashevskiy >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in the >>>>>>>>>>>>>>>>> system memory >>>>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for DMA. >>>>>>>>>>>>>>>>> It expects system page size aligned memory sections so >>>>>>>>>>>>>>>>> vfio_dma_map >>>>>>>>>>>>>>>>> would not fail and so far this has been the case. A mapping >>>>>>>>>>>>>>>>> failure >>>>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that some >>>>>>>>>>>>>>>>> MMIO pages >>>>>>>>>>>>>>>>> would not be mapped silently. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we will >>>>>>>>>>>>>>>>> end having >>>>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more details is >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was not a >>>>>>>>>>>>>>>>> failure and >>>>>>>>>>>>>>>>> was just silently ignored, this checks the section alignment >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an error if >>>>>>>>>>>>>>>>> not aligned; >>>>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite the >>>>>>>>>>>>>>>>> page size check. >>>>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are checked >>>>>>>>>>>>>>>>> (aka "RAM device" regions). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX >>>>>>>>>>>>>>>>> relocation >>>>>>>>>>>>>>>>> could be used to avoid excessive error output. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This is unlikely to cause any behavioral change. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There are some relatively superficial problems noted below. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an >>>>>>>>>>>>>>>> existing >>>>>>>>>>>>>>>> hack past the point of usefulness. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The explicit check for is_ram_device() here has always >>>>>>>>>>>>>>>> bothered me - >>>>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a >>>>>>>>>>>>>>>> target >>>>>>>>>>>>>>>> address maps to RAM or not. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What I think is really going on is that even for systems >>>>>>>>>>>>>>>> without an >>>>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address space >>>>>>>>>>>>>>>> maps >>>>>>>>>>>>>>>> directly onto address_space_memory. Instead, there's a large, >>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 on >>>>>>>>>>>>>>>> the PCI >>>>>>>>>>>>>>>> bus, which is identity mapped to the system bus. Details will >>>>>>>>>>>>>>>> vary >>>>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM to >>>>>>>>>>>>>>>> be in >>>>>>>>>>>>>>>> that window. Addresses not within that window won't be mapped >>>>>>>>>>>>>>>> to the >>>>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and might >>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>> picked up as a p2p transaction. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is not >>>>>>>>>>>>>>> possible as >>>>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p >>>>>>>>>>>>>>> work and it >>>>>>>>>>>>>>> does not. >>>>>>>>>>>>>> >>>>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS spec, >>>>>>>>>>>>>> though >>>>>>>>>>>>>> it's as prone to spoofing by the device as ATS. In any case, p2p >>>>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU even >>>>>>>>>>>>>> if >>>>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data doubles >>>>>>>>>>>>>> back >>>>>>>>>>>>>> over any links. Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple as >>>>>>>>>>>>> broadcast >>>>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to make >>>>>>>>>>>>> this work, >>>>>>>>>>>>> and current that broadcast won't work for the passed through >>>>>>>>>>>>> devices. >>>>>>>>>>>> >>>>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly needs >>>>>>>>>>>> to >>>>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to an >>>>>>>>>>>> emulated device is essentially impossible). >>>>>>>>>>>> >>>>>>>>>>>> But.. what does that have to do with this code. This is the memory >>>>>>>>>>>> area watcher, looking for memory regions being mapped directly into >>>>>>>>>>>> the PCI space. NOT IOMMU regions, since those are handled >>>>>>>>>>>> separately >>>>>>>>>>>> by wiring up the IOMMU notifier. This will only trigger if >>>>>>>>>>>> RAM-like, >>>>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU. >>>>>>>>>>> >>>>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU. I guess >>>>>>>>>>> the point here is that this will map RAM-like devices into the host >>>>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions >>>>>>>>>>> between >>>>>>>>>>> passthrough devices (though not from passthrough to emulated >>>>>>>>>>> devices). >>>>>>>>>> >>>>>>>>>> Correct. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The conditions still seem kind of awkward to me, but I guess it >>>>>>>>>>> makes >>>>>>>>>>> sense. >>>>>>>>>> >>>>>>>>>> Is it the time to split this listener to RAM-listener and PCI bus >>>>>>>>>> listener? >>>>>>>>> >>>>>>>>> I'm not really sure what you mean by that. >>>>>>>>> >>>>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the >>>>>>>>>> "pci@800000020000000" AS, this will just create more confusion over >>>>>>>>>> time... >>>>>>>>> >>>>>>>>> Hm, it's still logically the same AS in each case - the PCI address >>>>>>>>> space. It's just that in the x86 case that happens to be the same as >>>>>>>>> the memory AS (at least when there isn't a guest side IOMMU). >>>>>>>> >>>>>>>> Hm. Ok. >>>>>>>> >>>>>>>>> I do wonder if that's really right even for x86 without IOMMU. The >>>>>>>>> PCI address space is identity mapped to the "main" memory address >>>>>>>>> space there, but I'm not sure it's mapped th *all* of the main memory >>>>>>>>> >>>>>>>> >>>>>>>> What should have been in the place of "th" above? :) >>>>>>>> >>>>>>>>> address space, probably just certain ranges of it. That's what I was >>>>>>>>> suggesting earlier in the thread. >>>>>>>> >>>>>>>> afaict vfio is trying to mmap every RAM MR == KVM memory slot, no >>>>>>>> ranges or >>>>>>>> anything like that. I am trying to figure out what is not correct with >>>>>>>> the >>>>>>>> patch. Thanks, >>>>>>> >>>>>>> It is possible for x86 systems to have translation applied to MMIO vs >>>>>>> RAM such that the processor view and device view of MMIO are different, >>>>>>> putting them into separate address spaces, but this is not typical and >>>>>>> not available on the class of chipsets that QEMU emulates for PC. >>>>>>> Therefore I think it's correct that MMIO and RAM all live in one big >>>>>>> happy, flat address space as they do now (assuming the guest IOMMU is >>>>>>> not present or disabled). Thanks, >>>>>> >>>>>> Ah.. I think the thing I was missing is that on PC (at least with >>>>>> typical chipsets) *all* MMIO essentially comes from PCI space. Even >>>>>> the legacy devices are essentially ISA which is treated as being on a >>>>>> bridge under the PCI space. On non-x86 there are often at least a >>>>>> handful of MMIO devices that aren't within the PCI space - say, the >>>>>> PCI host bridge itself at least. x86 avoids that by using the >>>>>> separate IO space, which is also a bit weird in that it's >>>>>> simultaneously direct attached to the cpu (and so doesn't need bridge >>>>>> configuration), but also identified with the legay IO space treated as >>>>>> being under two bridges (host->PCI, PCI->ISA) for other purposes. >>>>>> >>>>>> Maybe it's just me, but I find it makes more sense to me if I think of >>>>>> it as the two ASes being equal because on PC the system address map >>>>>> (address_space_memory) is made equal to the PCI address map, rather >>>>>> than the PCI address map being made equal to the system one. >>>>> >>>>> It makes more sense to me too, we just do not exploit/expose this on SPAPR >>>>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's it, >>>>> no MMIO which is mapped to the system AS only (adding those to the PCI AS >>>>> is little tricky as mentioned elsewhere). >>>>> >>>>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch in >>>>> order to proceed? >>>> >>>> Ping? >>>> >>> >>> >>> Ping? >>> >>> >> >> >> Could anyone please enlighten me what needs to be done with this patchset >> to proceed, or confirm that it is ok but not going anywhere as everybody is >> busy with other things? :) Thanks, > > Actually making sure it compiles would have been nice: > > qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_add’: > qemu.git/hw/vfio/common.c:550:40: error: invalid operands to binary & (have > ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’) > if ((iova & pgmask) || (llsize & pgmask)) { > ^ > qemu.git/hw/vfio/common.c: In function ‘vfio_listener_region_del’: > qemu.git/hw/vfio/common.c:669:50: error: invalid operands to binary & (have > ‘Int128 {aka struct Int128}’ and ‘hwaddr {aka long long unsigned int}’) > try_unmap = !((iova & pgmask) || (llsize & pgmask)); > ^ > Clearly llsize needs to be wrapped in int128_get64() here. Should I > presume that testing of this patch is equally lacking? Eric, have you > done any testing of this? I think this was fixing a mapping issue you > reported. Thanks, not that version unfortunately. I don't have access to the HW on which I had the issue anymore. Maybe I could by the beg of next week?
Thanks Eric > > Alex >