Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 03/09/2015 07:26 AM, Andrew Jones wrote: On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote: On 03/05/2015 09:43 AM, Paolo Bonzini wrote: On 05/03/2015 15:58, Catalin Marinas wrote: It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. That's a problem indeed, especially with guest migration. But I don't think we have any sane solution here for the bus master DMA. I do not oppose doing cache management in QEMU for bus master DMA (though if the solution you outlined below works it would be great). ARM can override them as well but only making them stricter. Otherwise, on a weakly ordered architecture, it's not always safe (let's say the guest thinks it accesses Strongly Ordered memory and avoids barriers for flag updates but the host upgrades it to Cacheable which breaks the memory order). The same can happen on x86 though, even if it's rarer. You still need a barrier between stores and loads. If we want the host to enforce guest memory mapping attributes via stage 2, we could do it the other way around: get the guests to always assume full cache coherency, generating Normal Cacheable mappings, but use the stage 2 attributes restriction in the host to make such mappings non-cacheable when needed (it works this way on ARM but not in the other direction to relax the attributes). That sounds like a plan for device assignment. But it still would not solve the problem of the MMIO framebuffer, right? The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Unless the guest allocates the framebuffer itself (e.g. dma_alloc_coherent), we can't control the cacheability via dma-coherent properties as it refers to bus master DMA. Okay, it's good to rule that out. One less thing to think about. :) Same for _DSD. So for MMIO with the buffer allocated by the host (Qemu), the only solution I see on ARM is for the host to ensure coherency, either via explicit cache maintenance (new KVM API) or by changing the memory attributes used by Qemu to access such virtual MMIO. Basically Qemu is acting as a bus master when reading the framebuffer it allocated but the guest considers it a slave access and we don't have a way to tell the guest that such accesses should be cacheable, nor can we upgrade them via architecture features. Yes, that's a way to put it. In practice, the VGA framebuffer has an optimization that uses dirty page tracking, so we could piggyback on the ioctls that return which pages are dirty. It turns out that piggybacking on those ioctls also should fix the case of migrating a guest while the MMU is disabled. Yes, Qemu would need to invalidate the cache before reading a dirty framebuffer page. As I said above, an API that allows non-cacheable mappings for the VGA framebuffer in Qemu would also solve the problem. I'm not sure what KVM provides here (or whether we can add such API). Nothing for now; other architectures simply do not have the issue. As long as it's just VGA, we can quirk it. There's just a couple vendor/device IDs to catch, and the guest can then use a cacheable mapping. For a more generic solution, the API would be madvise(MADV_DONTCACHE). It would be easy for QEMU to use it, but I am not too optimistic about convincing the mm folks about it. We can try. I forgot to list this one in my summary of approaches[*]. This is a nice, clean approach. Avoids getting cache maintenance into everything. However, besides the difficulty to get it past mm people, it reduces performance for any userspace-userspace uses/sharing of the memory. userspace-guest requires cache maintenance, but nothing else. Maybe that's not an important concern for the few emulated devices that need it though. Interested to see the outcome. I was thinking of a very basic memory driver that can provide an uncached memslot to QEMU - in mmap() file operation apply pgprot_uncached to allocated pages, lock them, flush TLB call remap_pfn_range(). I guess this is the same as the madvise approach, but with a driver. KVM could take this approach itself when memslots are added/updated with the INCOHERENT flag. Maybe worth some experimental patches to find out? I would work on this but I'm tied up for next 3 weeks. If anyone is interested I can provide base code, I used it for memory passthrough although testing may be time consuming. I think the hurdle here is the kernel doesn't map these for any reason like page migration, locking pages should tell kernel don't
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote: On 03/05/2015 09:43 AM, Paolo Bonzini wrote: On 05/03/2015 15:58, Catalin Marinas wrote: It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. That's a problem indeed, especially with guest migration. But I don't think we have any sane solution here for the bus master DMA. I do not oppose doing cache management in QEMU for bus master DMA (though if the solution you outlined below works it would be great). ARM can override them as well but only making them stricter. Otherwise, on a weakly ordered architecture, it's not always safe (let's say the guest thinks it accesses Strongly Ordered memory and avoids barriers for flag updates but the host upgrades it to Cacheable which breaks the memory order). The same can happen on x86 though, even if it's rarer. You still need a barrier between stores and loads. If we want the host to enforce guest memory mapping attributes via stage 2, we could do it the other way around: get the guests to always assume full cache coherency, generating Normal Cacheable mappings, but use the stage 2 attributes restriction in the host to make such mappings non-cacheable when needed (it works this way on ARM but not in the other direction to relax the attributes). That sounds like a plan for device assignment. But it still would not solve the problem of the MMIO framebuffer, right? The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Unless the guest allocates the framebuffer itself (e.g. dma_alloc_coherent), we can't control the cacheability via dma-coherent properties as it refers to bus master DMA. Okay, it's good to rule that out. One less thing to think about. :) Same for _DSD. So for MMIO with the buffer allocated by the host (Qemu), the only solution I see on ARM is for the host to ensure coherency, either via explicit cache maintenance (new KVM API) or by changing the memory attributes used by Qemu to access such virtual MMIO. Basically Qemu is acting as a bus master when reading the framebuffer it allocated but the guest considers it a slave access and we don't have a way to tell the guest that such accesses should be cacheable, nor can we upgrade them via architecture features. Yes, that's a way to put it. In practice, the VGA framebuffer has an optimization that uses dirty page tracking, so we could piggyback on the ioctls that return which pages are dirty. It turns out that piggybacking on those ioctls also should fix the case of migrating a guest while the MMU is disabled. Yes, Qemu would need to invalidate the cache before reading a dirty framebuffer page. As I said above, an API that allows non-cacheable mappings for the VGA framebuffer in Qemu would also solve the problem. I'm not sure what KVM provides here (or whether we can add such API). Nothing for now; other architectures simply do not have the issue. As long as it's just VGA, we can quirk it. There's just a couple vendor/device IDs to catch, and the guest can then use a cacheable mapping. For a more generic solution, the API would be madvise(MADV_DONTCACHE). It would be easy for QEMU to use it, but I am not too optimistic about convincing the mm folks about it. We can try. I forgot to list this one in my summary of approaches[*]. This is a nice, clean approach. Avoids getting cache maintenance into everything. However, besides the difficulty to get it past mm people, it reduces performance for any userspace-userspace uses/sharing of the memory. userspace-guest requires cache maintenance, but nothing else. Maybe that's not an important concern for the few emulated devices that need it though. Interested to see the outcome. I was thinking of a very basic memory driver that can provide an uncached memslot to QEMU - in mmap() file operation apply pgprot_uncached to allocated pages, lock them, flush TLB call remap_pfn_range(). I guess this is the same as the madvise approach, but with a driver. KVM could take this approach itself when memslots are added/updated with the INCOHERENT flag. Maybe worth some experimental patches to find out? I'm still thinking about experimenting with the ARM private syscalls next though. drew [*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 03/04/2015 03:35 AM, Catalin Marinas wrote: (please try to avoid top-posting) On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote: On 03/02/2015 08:31 AM, Christoffer Dall wrote: However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I expressed my concerns as well, I'm definitely against merging this series. I don't understand how can the CPU handle different cache attributes used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache evictions or cleans wipe out guest updates to same cache line(s)? Clean+invalidate is a safe operation even if the guest accesses the memory in a cacheable way. But if the guest can update the cache lines, Qemu should avoid cache maintenance from a performance perspective. The guest is either told that the DMA is coherent (via DT properties) or Qemu deals with (non-)coherency itself. The latter is fully in line with the B2.9 chapter in the ARM ARM, more precisely point 5: If the mismatched attributes for a memory location all assign the same shareability attribute to the location, any loss of uniprocessor semantics or coherency within a shareability domain can be avoided by use of software cache management. ... it continues with what kind of cache maintenance is required, together with: A clean and invalidate instruction can be used instead of a clean instruction, or instead of an invalidate instruction. Hi Catalin, sorry for the top posting. I'm struggling with QEMU cache maintenance for devices that don't have registers cache line aligned and may be multi-function, for lack of a better one I thought of sp804 that supports two devices with registers covered by one cache line. Wouldn't QEMU cache maintenance of one device have potential to corrupt the second device? These could be used by two guest threads in parallel. I get bullet 2,3 still working on 1st one it will take a while. Thanks, - Mario -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 03/05/2015 09:43 AM, Paolo Bonzini wrote: On 05/03/2015 15:58, Catalin Marinas wrote: It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. That's a problem indeed, especially with guest migration. But I don't think we have any sane solution here for the bus master DMA. I do not oppose doing cache management in QEMU for bus master DMA (though if the solution you outlined below works it would be great). ARM can override them as well but only making them stricter. Otherwise, on a weakly ordered architecture, it's not always safe (let's say the guest thinks it accesses Strongly Ordered memory and avoids barriers for flag updates but the host upgrades it to Cacheable which breaks the memory order). The same can happen on x86 though, even if it's rarer. You still need a barrier between stores and loads. If we want the host to enforce guest memory mapping attributes via stage 2, we could do it the other way around: get the guests to always assume full cache coherency, generating Normal Cacheable mappings, but use the stage 2 attributes restriction in the host to make such mappings non-cacheable when needed (it works this way on ARM but not in the other direction to relax the attributes). That sounds like a plan for device assignment. But it still would not solve the problem of the MMIO framebuffer, right? The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Unless the guest allocates the framebuffer itself (e.g. dma_alloc_coherent), we can't control the cacheability via dma-coherent properties as it refers to bus master DMA. Okay, it's good to rule that out. One less thing to think about. :) Same for _DSD. So for MMIO with the buffer allocated by the host (Qemu), the only solution I see on ARM is for the host to ensure coherency, either via explicit cache maintenance (new KVM API) or by changing the memory attributes used by Qemu to access such virtual MMIO. Basically Qemu is acting as a bus master when reading the framebuffer it allocated but the guest considers it a slave access and we don't have a way to tell the guest that such accesses should be cacheable, nor can we upgrade them via architecture features. Yes, that's a way to put it. In practice, the VGA framebuffer has an optimization that uses dirty page tracking, so we could piggyback on the ioctls that return which pages are dirty. It turns out that piggybacking on those ioctls also should fix the case of migrating a guest while the MMU is disabled. Yes, Qemu would need to invalidate the cache before reading a dirty framebuffer page. As I said above, an API that allows non-cacheable mappings for the VGA framebuffer in Qemu would also solve the problem. I'm not sure what KVM provides here (or whether we can add such API). Nothing for now; other architectures simply do not have the issue. As long as it's just VGA, we can quirk it. There's just a couple vendor/device IDs to catch, and the guest can then use a cacheable mapping. For a more generic solution, the API would be madvise(MADV_DONTCACHE). It would be easy for QEMU to use it, but I am not too optimistic about convincing the mm folks about it. We can try. Interested to see the outcome. I was thinking of a very basic memory driver that can provide an uncached memslot to QEMU - in mmap() file operation apply pgprot_uncached to allocated pages, lock them, flush TLB call remap_pfn_range(). Mario Paolo ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 05/03/2015 15:58, Catalin Marinas wrote: It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. That's a problem indeed, especially with guest migration. But I don't think we have any sane solution here for the bus master DMA. I do not oppose doing cache management in QEMU for bus master DMA (though if the solution you outlined below works it would be great). ARM can override them as well but only making them stricter. Otherwise, on a weakly ordered architecture, it's not always safe (let's say the guest thinks it accesses Strongly Ordered memory and avoids barriers for flag updates but the host upgrades it to Cacheable which breaks the memory order). The same can happen on x86 though, even if it's rarer. You still need a barrier between stores and loads. If we want the host to enforce guest memory mapping attributes via stage 2, we could do it the other way around: get the guests to always assume full cache coherency, generating Normal Cacheable mappings, but use the stage 2 attributes restriction in the host to make such mappings non-cacheable when needed (it works this way on ARM but not in the other direction to relax the attributes). That sounds like a plan for device assignment. But it still would not solve the problem of the MMIO framebuffer, right? The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Unless the guest allocates the framebuffer itself (e.g. dma_alloc_coherent), we can't control the cacheability via dma-coherent properties as it refers to bus master DMA. Okay, it's good to rule that out. One less thing to think about. :) Same for _DSD. So for MMIO with the buffer allocated by the host (Qemu), the only solution I see on ARM is for the host to ensure coherency, either via explicit cache maintenance (new KVM API) or by changing the memory attributes used by Qemu to access such virtual MMIO. Basically Qemu is acting as a bus master when reading the framebuffer it allocated but the guest considers it a slave access and we don't have a way to tell the guest that such accesses should be cacheable, nor can we upgrade them via architecture features. Yes, that's a way to put it. In practice, the VGA framebuffer has an optimization that uses dirty page tracking, so we could piggyback on the ioctls that return which pages are dirty. It turns out that piggybacking on those ioctls also should fix the case of migrating a guest while the MMU is disabled. Yes, Qemu would need to invalidate the cache before reading a dirty framebuffer page. As I said above, an API that allows non-cacheable mappings for the VGA framebuffer in Qemu would also solve the problem. I'm not sure what KVM provides here (or whether we can add such API). Nothing for now; other architectures simply do not have the issue. As long as it's just VGA, we can quirk it. There's just a couple vendor/device IDs to catch, and the guest can then use a cacheable mapping. For a more generic solution, the API would be madvise(MADV_DONTCACHE). It would be easy for QEMU to use it, but I am not too optimistic about convincing the mm folks about it. We can try. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 5 March 2015 at 15:58, Catalin Marinas catalin.mari...@arm.com wrote: On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote: On 05/03/2015 13:03, Catalin Marinas wrote: I'd hate to have to do that. PCI should be entirely probeable given that we tell the guest where the host bridge is, that's one of its advantages. I didn't say a DT node per device, the DT doesn't know what PCI devices are available (otherwise it defeats the idea of probing). But we need to tell the OS where the host bridge is via DT. So the guest would be told about two host bridges: one for real devices and another for virtual devices. These can have different coherency properties. Yeah, and it would suck that the user needs to know the difference between the coherency proprties of the host bridges. The host needs to know about this, unless we assume full coherency on all the platforms. Arguably, Qemu needs to know as well if it is the one generating the DT for guest (or at least passing some snippets from the host DT). It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. That's a problem indeed, especially with guest migration. But I don't think we have any sane solution here for the bus master DMA. To avoid replying in two different places, which of the solutions look to me like something that half-works? Pretty much all of them, because in the end it is just a processor misfeature. For example, Intel virtualization extensions let the hypervisor override stage1 translation _if necessary_. AMD doesn't, but has some other quirky things that let you achieve the same effect.. ARM can override them as well but only making them stricter. Otherwise, on a weakly ordered architecture, it's not always safe (let's say the guest thinks it accesses Strongly Ordered memory and avoids barriers for flag updates but the host upgrades it to Cacheable which breaks the memory order). If we want the host to enforce guest memory mapping attributes via stage 2, we could do it the other way around: get the guests to always assume full cache coherency, generating Normal Cacheable mappings, but use the stage 2 attributes restriction in the host to make such mappings non-cacheable when needed (it works this way on ARM but not in the other direction to relax the attributes). This was precisely the idea of the MAIR mangling patch: promote all stage1 mappings to cacheable, and put the host in control by allowing it to supersede them with device mappings in stage 2. But it appears there are too many corner cases where this doesn't quite work out. In particular, I am not even sure that this is about bus coherency, because this problem does not happen when the device is doing bus master DMA. Working around coherency for bus master DMA would be easy. My previous emails on the dma-coherent property were only about bus master DMA (which would cause the correct selection of the DMA API ops in the guest). But even for bus master DMA, guest OS still needs to be aware of the (virtual) device DMA capabilities (cache coherent or not). You may be able to work around it in the host (stage 2, explicit cache flushing or SMMU attributes) if the guests assumes non-coherency but it's not really efficient (nor nice to implement). The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Unless the guest allocates the framebuffer itself (e.g. dma_alloc_coherent), we can't control the cacheability via dma-coherent properties as it refers to bus master DMA. So for MMIO with the buffer allocated by the host (Qemu), the only solution I see on ARM is for the host to ensure coherency, either via explicit cache maintenance (new KVM API) or by changing the memory attributes used by Qemu to access such virtual MMIO. Basically Qemu is acting as a bus master when reading the framebuffer it allocated but the guest considers it a slave access and we don't have a way to tell the guest that such accesses should be cacheable, nor can we upgrade them via architecture features. Currently the VGA framebuffer is the main case where this happen, and I don't expect many more. Because this is not bus master DMA, it's hard to find a QEMU API that can be hooked to invalidate the cache. QEMU is just reading from an array of chars. I now understand the problem better. I was under the impression that the guest allocates the framebuffer itself and tells Qemu where it is (like in amba-clcd.c for example). In practice, the VGA
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 5 March 2015 at 20:04, Catalin Marinas catalin.mari...@arm.com wrote: On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote: On 04/03/2015 18:28, Catalin Marinas wrote: Can you add that property to the device tree for PCI devices too? Yes but not with mainline yet: http://thread.gmane.org/gmane.linux.kernel.iommu/8935 We can add the property at the PCI host bridge level (with the drawback that it covers all the PCI devices), like here: Even covering all PCI devices is not enough if we want to support device assignment of PCI host devices. Can we not have another PCI bridge node in the DT for the host device assignments? I'd hate to have to do that. PCI should be entirely probeable given that we tell the guest where the host bridge is, that's one of its advantages. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Thu, Mar 05, 2015 at 08:52:49PM +0900, Peter Maydell wrote: On 5 March 2015 at 20:04, Catalin Marinas catalin.mari...@arm.com wrote: On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote: On 04/03/2015 18:28, Catalin Marinas wrote: Can you add that property to the device tree for PCI devices too? Yes but not with mainline yet: http://thread.gmane.org/gmane.linux.kernel.iommu/8935 We can add the property at the PCI host bridge level (with the drawback that it covers all the PCI devices), like here: Even covering all PCI devices is not enough if we want to support device assignment of PCI host devices. Can we not have another PCI bridge node in the DT for the host device assignments? I'd hate to have to do that. PCI should be entirely probeable given that we tell the guest where the host bridge is, that's one of its advantages. I didn't say a DT node per device, the DT doesn't know what PCI devices are available (otherwise it defeats the idea of probing). But we need to tell the OS where the host bridge is via DT. So the guest would be told about two host bridges: one for real devices and another for virtual devices. These can have different coherency properties. -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 04/03/2015 18:28, Catalin Marinas wrote: Can you add that property to the device tree for PCI devices too? Yes but not with mainline yet: http://thread.gmane.org/gmane.linux.kernel.iommu/8935 We can add the property at the PCI host bridge level (with the drawback that it covers all the PCI devices), like here: Even covering all PCI devices is not enough if we want to support device assignment of PCI host devices. I'd rather not spend effort on a solution that we know will only half-work a few months down the road. Paolo Documentation/devicetree/bindings/pci/xgene-pci.txt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote: On 04/03/2015 18:28, Catalin Marinas wrote: Can you add that property to the device tree for PCI devices too? Yes but not with mainline yet: http://thread.gmane.org/gmane.linux.kernel.iommu/8935 We can add the property at the PCI host bridge level (with the drawback that it covers all the PCI devices), like here: Even covering all PCI devices is not enough if we want to support device assignment of PCI host devices. Can we not have another PCI bridge node in the DT for the host device assignments? I'd rather not spend effort on a solution that we know will only half-work a few months down the road. Which of the solutions are you referring to? On the Applied Micro boards, for example, the PCIe is coherent. If you do device assignment, the guest must be aware of the coherency property of the physical device and behave accordingly, there isn't much the host can do about it. -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 05/03/2015 13:03, Catalin Marinas wrote: I'd hate to have to do that. PCI should be entirely probeable given that we tell the guest where the host bridge is, that's one of its advantages. I didn't say a DT node per device, the DT doesn't know what PCI devices are available (otherwise it defeats the idea of probing). But we need to tell the OS where the host bridge is via DT. So the guest would be told about two host bridges: one for real devices and another for virtual devices. These can have different coherency properties. Yeah, and it would suck that the user needs to know the difference between the coherency proprties of the host bridges. It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. To avoid replying in two different places, which of the solutions look to me like something that half-works? Pretty much all of them, because in the end it is just a processor misfeature. For example, Intel virtualization extensions let the hypervisor override stage1 translation _if necessary_. AMD doesn't, but has some other quirky things that let you achieve the same effect.. In particular, I am not even sure that this is about bus coherency, because this problem does not happen when the device is doing bus master DMA. Working around coherency for bus master DMA would be easy. The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Currently the VGA framebuffer is the main case where this happen, and I don't expect many more. Because this is not bus master DMA, it's hard to find a QEMU API that can be hooked to invalidate the cache. QEMU is just reading from an array of chars. In practice, the VGA framebuffer has an optimization that uses dirty page tracking, so we could piggyback on the ioctls that return which pages are dirty. It turns out that piggybacking on those ioctls also should fix the case of migrating a guest while the MMU is disabled. But it's a hack, and it may not work for other devices. We could use _DSD to export the device tree property separately for each device, but that wouldn't work for hotplugged devices. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote: On 05/03/2015 13:03, Catalin Marinas wrote: I'd hate to have to do that. PCI should be entirely probeable given that we tell the guest where the host bridge is, that's one of its advantages. I didn't say a DT node per device, the DT doesn't know what PCI devices are available (otherwise it defeats the idea of probing). But we need to tell the OS where the host bridge is via DT. So the guest would be told about two host bridges: one for real devices and another for virtual devices. These can have different coherency properties. Yeah, and it would suck that the user needs to know the difference between the coherency proprties of the host bridges. The host needs to know about this, unless we assume full coherency on all the platforms. Arguably, Qemu needs to know as well if it is the one generating the DT for guest (or at least passing some snippets from the host DT). It would especially suck if the user has a cluster with different machines, some of them coherent and others non-coherent, and then has to debug why the same configuration works on some machines and not on others. That's a problem indeed, especially with guest migration. But I don't think we have any sane solution here for the bus master DMA. To avoid replying in two different places, which of the solutions look to me like something that half-works? Pretty much all of them, because in the end it is just a processor misfeature. For example, Intel virtualization extensions let the hypervisor override stage1 translation _if necessary_. AMD doesn't, but has some other quirky things that let you achieve the same effect.. ARM can override them as well but only making them stricter. Otherwise, on a weakly ordered architecture, it's not always safe (let's say the guest thinks it accesses Strongly Ordered memory and avoids barriers for flag updates but the host upgrades it to Cacheable which breaks the memory order). If we want the host to enforce guest memory mapping attributes via stage 2, we could do it the other way around: get the guests to always assume full cache coherency, generating Normal Cacheable mappings, but use the stage 2 attributes restriction in the host to make such mappings non-cacheable when needed (it works this way on ARM but not in the other direction to relax the attributes). In particular, I am not even sure that this is about bus coherency, because this problem does not happen when the device is doing bus master DMA. Working around coherency for bus master DMA would be easy. My previous emails on the dma-coherent property were only about bus master DMA (which would cause the correct selection of the DMA API ops in the guest). But even for bus master DMA, guest OS still needs to be aware of the (virtual) device DMA capabilities (cache coherent or not). You may be able to work around it in the host (stage 2, explicit cache flushing or SMMU attributes) if the guests assumes non-coherency but it's not really efficient (nor nice to implement). The problem arises with MMIO areas that the guest can reasonably expect to be uncacheable, but that are optimized by the host so that they end up backed by cacheable RAM. It's perfectly reasonable that the same device needs cacheable mapping with one userspace, and works with uncacheable mapping with another userspace that doesn't optimize the MMIO area to RAM. Unless the guest allocates the framebuffer itself (e.g. dma_alloc_coherent), we can't control the cacheability via dma-coherent properties as it refers to bus master DMA. So for MMIO with the buffer allocated by the host (Qemu), the only solution I see on ARM is for the host to ensure coherency, either via explicit cache maintenance (new KVM API) or by changing the memory attributes used by Qemu to access such virtual MMIO. Basically Qemu is acting as a bus master when reading the framebuffer it allocated but the guest considers it a slave access and we don't have a way to tell the guest that such accesses should be cacheable, nor can we upgrade them via architecture features. Currently the VGA framebuffer is the main case where this happen, and I don't expect many more. Because this is not bus master DMA, it's hard to find a QEMU API that can be hooked to invalidate the cache. QEMU is just reading from an array of chars. I now understand the problem better. I was under the impression that the guest allocates the framebuffer itself and tells Qemu where it is (like in amba-clcd.c for example). In practice, the VGA framebuffer has an optimization that uses dirty page tracking, so we could piggyback on the ioctls that return which pages are dirty. It turns out that piggybacking on those ioctls also should fix the case of migrating a guest while the MMU is disabled. Yes, Qemu would need to invalidate the cache before reading a dirty framebuffer page. As I said above, an API that allows
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Wed, Mar 04, 2015 at 06:03:11PM +0100, Paolo Bonzini wrote: On 04/03/2015 15:29, Catalin Marinas wrote: I disagree it is 100% a host-side issue. It is a host-side issue _if_ the host tells the guest that the (virtual) device is non-coherent (or, more precisely, it does not explicitly tell the guest that the device is coherent). If the guest thinks the (virtual) device is non-coherent because of information passed by the host, I fully agree that the host needs to manage the cache coherency. However, the host could also pass a dma-coherent property in the DT given to the guest and avoid any form of cache maintenance. If the guest does not honour such coherency property, it's a guest problem and it needs fixing in the guest. This isn't any different from a real physical device behaviour. Can you add that property to the device tree for PCI devices too? Yes but not with mainline yet: http://thread.gmane.org/gmane.linux.kernel.iommu/8935 We can add the property at the PCI host bridge level (with the drawback that it covers all the PCI devices), like here: Documentation/devicetree/bindings/pci/xgene-pci.txt -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 04/03/2015 15:29, Catalin Marinas wrote: I disagree it is 100% a host-side issue. It is a host-side issue _if_ the host tells the guest that the (virtual) device is non-coherent (or, more precisely, it does not explicitly tell the guest that the device is coherent). If the guest thinks the (virtual) device is non-coherent because of information passed by the host, I fully agree that the host needs to manage the cache coherency. However, the host could also pass a dma-coherent property in the DT given to the guest and avoid any form of cache maintenance. If the guest does not honour such coherency property, it's a guest problem and it needs fixing in the guest. This isn't any different from a real physical device behaviour. Can you add that property to the device tree for PCI devices too? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
(please try to avoid top-posting) On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote: On 03/02/2015 08:31 AM, Christoffer Dall wrote: However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I expressed my concerns as well, I'm definitely against merging this series. I don't understand how can the CPU handle different cache attributes used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache evictions or cleans wipe out guest updates to same cache line(s)? Clean+invalidate is a safe operation even if the guest accesses the memory in a cacheable way. But if the guest can update the cache lines, Qemu should avoid cache maintenance from a performance perspective. The guest is either told that the DMA is coherent (via DT properties) or Qemu deals with (non-)coherency itself. The latter is fully in line with the B2.9 chapter in the ARM ARM, more precisely point 5: If the mismatched attributes for a memory location all assign the same shareability attribute to the location, any loss of uniprocessor semantics or coherency within a shareability domain can be avoided by use of software cache management. ... it continues with what kind of cache maintenance is required, together with: A clean and invalidate instruction can be used instead of a clean instruction, or instead of an invalidate instruction. -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote: (please try to avoid top-posting) On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote: On 03/02/2015 08:31 AM, Christoffer Dall wrote: However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I expressed my concerns as well, I'm definitely against merging this series. Don't worry, that was never the intention, at least not as-is :-) I think we have established that the performance hit is not the problem but the correctness is. I do have a remaining question, though: my original [non-working] approach was to replace uncached mappings with write-through read-allocate write-allocate, which I expected would keep the caches in sync with main memory, but apparently I am misunderstanding something here. (This is the reason for s/0xbb/0xff/ in patch #2 to get it to work: it replaces WT/RA/WA with WB/RA/WA) Is there no way to use write-through caching here? I don't understand how can the CPU handle different cache attributes used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache evictions or cleans wipe out guest updates to same cache line(s)? Clean+invalidate is a safe operation even if the guest accesses the memory in a cacheable way. But if the guest can update the cache lines, Qemu should avoid cache maintenance from a performance perspective. The guest is either told that the DMA is coherent (via DT properties) or Qemu deals with (non-)coherency itself. The latter is fully in line with the B2.9 chapter in the ARM ARM, more precisely point 5: If the mismatched attributes for a memory location all assign the same shareability attribute to the location, any loss of uniprocessor semantics or coherency within a shareability domain can be avoided by use of software cache management. ... it continues with what kind of cache maintenance is required, together with: A clean and invalidate instruction can be used instead of a clean instruction, or instead of an invalidate instruction. -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote: On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote: On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote: On 03/02/2015 08:31 AM, Christoffer Dall wrote: However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I expressed my concerns as well, I'm definitely against merging this series. Don't worry, that was never the intention, at least not as-is :-) I wasn't worried, just wanted to make my position clearer ;). I think we have established that the performance hit is not the problem but the correctness is. I haven't looked at the performance figures but has anyone assessed the hit caused by doing cache maintenance in Qemu vs cacheable guest accesses (and no maintenance)? I do have a remaining question, though: my original [non-working] approach was to replace uncached mappings with write-through read-allocate write-allocate, Does it make sense to have write-through and write-allocate at the same time? The write-allocate hint would probably be ignored as write-through writes do not generate linefills. which I expected would keep the caches in sync with main memory, but apparently I am misunderstanding something here. (This is the reason for s/0xbb/0xff/ in patch #2 to get it to work: it replaces WT/RA/WA with WB/RA/WA) Is there no way to use write-through caching here? Write-through is considered non-cacheable from a write perspective when it does not hit in the cache. AFAIK, it should still be able to hit existing cache lines and evict. The ARM ARM states that cache cleaning to _PoU_ is not required for coherency when the writes are to write-through memory but I have to dig further into the PoC because that's what we care about here. What platform did you test it on? I can't tell what the behaviour of system caches is. I know they intercept explicit cache maintenance by VA but not sure what happens to write-through writes when they hit in the system cache (are they evicted to RAM or not?). If such write-through writes are only evicted to the point-of-unification, they won't work since non-cacheable accesses go all the way to PoC. I need to do more reading through the ARM ARM, it should be hidden somewhere ;). -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 4 March 2015 at 23:29, Catalin Marinas catalin.mari...@arm.com wrote: I disagree it is 100% a host-side issue. It is a host-side issue _if_ the host tells the guest that the (virtual) device is non-coherent (or, more precisely, it does not explicitly tell the guest that the device is coherent). If the guest thinks the (virtual) device is non-coherent because of information passed by the host, I fully agree that the host needs to manage the cache coherency. However, the host could also pass a dma-coherent property in the DT given to the guest and avoid any form of cache maintenance. If the guest does not honour such coherency property, it's a guest problem and it needs fixing in the guest. This isn't any different from a real physical device behaviour. Right, and we should do that for things like virtio, because we want the performance. But we also have devices (like vga framebuffers) which shouldn't be handled as cacheable, so we need to be able to deal with both situations. -- PMM -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Wed, Mar 04, 2015 at 03:12:12PM +0100, Andrew Jones wrote: On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote: On 4 March 2015 at 13:29, Catalin Marinas catalin.mari...@arm.com wrote: On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote: I think we have established that the performance hit is not the problem but the correctness is. I haven't looked at the performance figures but has anyone assessed the hit caused by doing cache maintenance in Qemu vs cacheable guest accesses (and no maintenance)? I'm working on a PoC of a QEMU/KVM cache maintenance approach now. Hopefully I'll send it out this evening. Tomorrow at the latest. Getting numbers of that approach vs. a guest's use of cached memory for devices would take a decent amount of additional work, so won't be part of that post. OK. I'm actually not sure we should care about the numbers for a guest using normal mem attributes for device memory - other than out of curiosity. For correctness this issue really needs to be solved 100% host-side. We can't rely on a guest to do different/weird things, just because it's a guest. Ideally guests don't even know that they're guests. (Even if we describe the memory as cache-able to the guest, I don't think we can rely on the guest believing us.) I disagree it is 100% a host-side issue. It is a host-side issue _if_ the host tells the guest that the (virtual) device is non-coherent (or, more precisely, it does not explicitly tell the guest that the device is coherent). If the guest thinks the (virtual) device is non-coherent because of information passed by the host, I fully agree that the host needs to manage the cache coherency. However, the host could also pass a dma-coherent property in the DT given to the guest and avoid any form of cache maintenance. If the guest does not honour such coherency property, it's a guest problem and it needs fixing in the guest. This isn't any different from a real physical device behaviour. (there are counter arguments for the latter as well like emulating old platforms that never had coherency but from a performance/production perspective, I strongly recommend that guests are passed the dma-coherent property for such virtual devices) -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 4 March 2015 at 13:29, Catalin Marinas catalin.mari...@arm.com wrote: On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote: On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote: On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote: On 03/02/2015 08:31 AM, Christoffer Dall wrote: However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I expressed my concerns as well, I'm definitely against merging this series. Don't worry, that was never the intention, at least not as-is :-) I wasn't worried, just wanted to make my position clearer ;). I think we have established that the performance hit is not the problem but the correctness is. I haven't looked at the performance figures but has anyone assessed the hit caused by doing cache maintenance in Qemu vs cacheable guest accesses (and no maintenance)? No, I don't think so. The performance hit I am referring to is the performance hit caused by leaving the trapping of VM system register writes enabled all the time, so that writes to MAIR_EL1 are always caught. This is why patch #1 implements some of the sysreg write handling in EL2 I do have a remaining question, though: my original [non-working] approach was to replace uncached mappings with write-through read-allocate write-allocate, Does it make sense to have write-through and write-allocate at the same time? The write-allocate hint would probably be ignored as write-through writes do not generate linefills. OK, that answers my question then. The availability of a write-allocate setting on write-through attributes suggested to me that writes would go to both the cache and main memory, so that the write-back cached attribute the host is using for the same memory would not result in it reading stale data. which I expected would keep the caches in sync with main memory, but apparently I am misunderstanding something here. (This is the reason for s/0xbb/0xff/ in patch #2 to get it to work: it replaces WT/RA/WA with WB/RA/WA) Is there no way to use write-through caching here? Write-through is considered non-cacheable from a write perspective when it does not hit in the cache. AFAIK, it should still be able to hit existing cache lines and evict. The ARM ARM states that cache cleaning to _PoU_ is not required for coherency when the writes are to write-through memory but I have to dig further into the PoC because that's what we care about here. What platform did you test it on? I can't tell what the behaviour of system caches is. I know they intercept explicit cache maintenance by VA but not sure what happens to write-through writes when they hit in the system cache (are they evicted to RAM or not?). If such write-through writes are only evicted to the point-of-unification, they won't work since non-cacheable accesses go all the way to PoC. This was tested on APM, by Drew and Laszlo (thanks guys) I have recently received a Seattle myself, but I haven't had time yet to test these patches myself. I need to do more reading through the ARM ARM, If you say so :-) it should be hidden somewhere ;). -- Ard. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote: On 4 March 2015 at 13:29, Catalin Marinas catalin.mari...@arm.com wrote: On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote: On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote: On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote: On 03/02/2015 08:31 AM, Christoffer Dall wrote: However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I expressed my concerns as well, I'm definitely against merging this series. Don't worry, that was never the intention, at least not as-is :-) I wasn't worried, just wanted to make my position clearer ;). I think we have established that the performance hit is not the problem but the correctness is. I haven't looked at the performance figures but has anyone assessed the hit caused by doing cache maintenance in Qemu vs cacheable guest accesses (and no maintenance)? I'm working on a PoC of a QEMU/KVM cache maintenance approach now. Hopefully I'll send it out this evening. Tomorrow at the latest. Getting numbers of that approach vs. a guest's use of cached memory for devices would take a decent amount of additional work, so won't be part of that post. I'm actually not sure we should care about the numbers for a guest using normal mem attributes for device memory - other than out of curiosity. For correctness this issue really needs to be solved 100% host-side. We can't rely on a guest to do different/weird things, just because it's a guest. Ideally guests don't even know that they're guests. (Even if we describe the memory as cache-able to the guest, I don't think we can rely on the guest believing us.) drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) That's just for data accesses. IIRC instructions are cacheable on ARMv8 (though I think without allocation in the unified caches). The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. There is another big downside to this - breaking the guest assumptions about the (non-)cacheability of its mappings. It also only works for guests that use MAIR_EL1 (LPAE). We have two main cases where the guest and host cacheability do not match: 1. During boot, as you said, when the MMU is off. What we have done in the guest kernel is to invalidate the data ranges that it writes with the MMU off in case there were any speculatively loaded cache lines via the cacheable mappings (in the host). We don't have any nice solution in the host here and MAIR_EL1 tweaking does not work 2. Guest explicitly creating a non-cacheable mapping (MMU enabled). Here we have two sub-cases: a) guest-only accesses to such mapping. The guest would need to perform cache maintenance as required if it ever accesses such memory via cacheable mappings (we do this already, see the streaming DMA API) b) memory shared with the host: e.g Qemu emulating DMA (frame buffer etc.) This 2.b case is not any different than the OS dealing with a (non-)coherent DMA-capable device. If the device is coherent, the DMA buffer in the guest must be coherent as well, otherwise non-coherent. Imagine a real VGA device that always snoops CPU caches. You would not create a non-cacheable frame buffer mapping since the device cannot see the updates and only read stale cache entries. We don't (can't) have a safe set of DMA ops that would work in both cases. So if Qemu cannot use a non-cacheable mapping or cannot perform cache maintenance, the only solution is to tell the guest that such virtual device is cache _coherent_. This also gives you better performance overall anyway. -- Catalin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 03/03/15 18:34, Alexander Graf wrote: On 02/19/2015 11:54 AM, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. I gave this a quick spin on a VM running with QEMU. * VGA output is still distorted, I get random junk black lines in the output in between * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even boot up Do you also have the dirty page tracking patches in your host kernel? I needed both (and got them via Drew's backport, thanks) and then both VGA and USB started working fine. Without the MAIR patches, I got cache-line size random corruptions in the VGA display (16 pixel wide small segments). Without dirty page tracking, big chunks (sometimes even almost the entire screen) was blank. Regarding USB, unless you have both of the patchsets in the host kernel, the guest will indeed crash early during boot. Gerd confirmed for me that usb controller (all uhci/ehci/xhci) pci regions see both read (status bits) and write (control bits) access. So if there's any corruption in there, on read, that looks like a malfunctioning piece of hw for the guest kernel, and in this case it happens to crash. With TCG, both bits work fine. Yep. Thanks Laszlo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Tue, Mar 03, 2015 at 07:13:48PM +0100, Laszlo Ersek wrote: On 03/03/15 18:34, Alexander Graf wrote: On 02/19/2015 11:54 AM, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. I gave this a quick spin on a VM running with QEMU. * VGA output is still distorted, I get random junk black lines in the output in between * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even boot up Do you also have the dirty page tracking patches in your host kernel? I needed both (and got them via Drew's backport, thanks) and then both VGA and USB started working fine. Assuming you have the dirty page tracking already, then you're probably missing the fixup to patch 2/3, s/0xbb/0xff/ Without the MAIR patches, I got cache-line size random corruptions in the VGA display (16 pixel wide small segments). Without dirty page tracking, big chunks (sometimes even almost the entire screen) was blank. Regarding USB, unless you have both of the patchsets in the host kernel, the guest will indeed crash early during boot. Gerd confirmed for me that usb controller (all uhci/ehci/xhci) pci regions see both read (status bits) and write (control bits) access. So if there's any corruption in there, on read, that looks like a malfunctioning piece of hw for the guest kernel, and in this case it happens to crash. With TCG, both bits work fine. Yep. Thanks Laszlo ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 02/19/2015 11:54 AM, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. I gave this a quick spin on a VM running with QEMU. * VGA output is still distorted, I get random junk black lines in the output in between * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even boot up With TCG, both bits work fine. Alex Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
Hi Christoffer, I don't understand how can the CPU handle different cache attributes used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache evictions or cleans wipe out guest updates to same cache line(s)? - Mario On 03/02/2015 08:31 AM, Christoffer Dall wrote: On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote: On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote: On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote: On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote: On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? Guess not. I dropped the test into a module_init and inserted it on the host. Always get zero for pmccntr_el0 reads. Or, if I set it to something non-zero with a write, then I always get that back - no increments. pmcr_el0 looks OK... I had forgotten to set bit 31 of pmcntenset_el0, but doing that still doesn't help. Anyway, I assume the problem is me. I'll keep looking to see what I'm missing. I returned to this and see that the problem was indeed me. I needed yet another enable bit set (the filter register needed to be instructed to count cycles while in el2). I've attached the code for the curious. The numbers are mean=6999, std_dev=242. Run on the host, or in a guest running on a host without this patch series (after TVM traps have been disabled), I get a pretty consistent 40. I checked how many vm-sysreg traps we do during the kernel compile benchmark. It's 124924. So it's a bit strange that we don't see the benchmark taking 10 to 20 seconds longer on average. I should probably double check my runs. In any case, while I like the approach of this series, the overhead is looking non-negligible. Thanks a lot for producing these numbers. 125k x 7k == 1 billion cycles == 1 second on a 1 GHz machine, I think? Or am I missing something? How long does the actual compile take? I ran a sequence of benchmarks that I occasionally run (pbzip, kernbench, and hackbench) and I also saw 1% performance degradation, so I think we can trust that somewhat. (I can post the raw numbers when I have ssh access to my Linux desktop - sending this from Somewhere Over The Atlantic). However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. Thanks, -Christoffer ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote: On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote: On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote: On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote: On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? Guess not. I dropped the test into a module_init and inserted it on the host. Always get zero for pmccntr_el0 reads. Or, if I set it to something non-zero with a write, then I always get that back - no increments. pmcr_el0 looks OK... I had forgotten to set bit 31 of pmcntenset_el0, but doing that still doesn't help. Anyway, I assume the problem is me. I'll keep looking to see what I'm missing. I returned to this and see that the problem was indeed me. I needed yet another enable bit set (the filter register needed to be instructed to count cycles while in el2). I've attached the code for the curious. The numbers are mean=6999, std_dev=242. Run on the host, or in a guest running on a host without this patch series (after TVM traps have been disabled), I get a pretty consistent 40. I checked how many vm-sysreg traps we do during the kernel compile benchmark. It's 124924. So it's a bit strange that we don't see the benchmark taking 10 to 20 seconds longer on average. I should probably double check my runs. In any case, while I like the approach of this series, the overhead is looking non-negligible. Thanks a lot for producing these numbers. 125k x 7k == 1 billion cycles == 1 second on a 1 GHz machine, I think? Or am I missing something? How long does the actual compile take? I ran a sequence of benchmarks that I occasionally run (pbzip, kernbench, and hackbench) and I also saw 1% performance degradation, so I think we can trust that somewhat. (I can post the raw numbers when I have ssh access to my Linux desktop - sending this from Somewhere Over The Atlantic). However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 03/02/15 17:47, Paolo Bonzini wrote: Also, we may want to invalidate the cache for dirty pages before returning the dirty bitmap, and probably should do that directly in KVM_GET_DIRTY_LOG. I agree. If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from userspace's aspect), then the cache invalidation should be an atomic part of it too (from the same aspect). (Sorry if I just said something incredibly stupid.) Laszlo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Mon, Mar 02, 2015 at 05:55:44PM +0100, Laszlo Ersek wrote: On 03/02/15 17:47, Paolo Bonzini wrote: Also, we may want to invalidate the cache for dirty pages before returning the dirty bitmap, and probably should do that directly in KVM_GET_DIRTY_LOG. I agree. If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from userspace's aspect), then the cache invalidation should be an atomic part of it too (from the same aspect). (Sorry if I just said something incredibly stupid.) With the path I'm headed down, all cache maintenance operations will be done before exiting to userspace (and after returning). I was actually already letting a feature creep into this PoC by setting KVM_MEM_LOG_DIRTY_PAGES when we see KVM_MEM_INCOHERENT has been set, and the region isn't readonly. The dirty log would then be used by KVM internally to know exactly which pages need to be invalidated before the exit. drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 02/03/2015 17:31, Christoffer Dall wrote: 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I'm okay with adding a KVM capability and ioctl that flushes the dcache for a given gpa range. However: 1) I'd like to have an implementation for QEMU and/or kvmtool before accepting that ioctl. 2) I think the ioctl should work whatever the stage1 mapping is (e.g. with and without Ard's patches, with and without Laszlo's OVMF patch, etc.). Also, we may want to invalidate the cache for dirty pages before returning the dirty bitmap, and probably should do that directly in KVM_GET_DIRTY_LOG. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Mon, Mar 02, 2015 at 08:31:46AM -0800, Christoffer Dall wrote: On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote: On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote: On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote: On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote: On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? Guess not. I dropped the test into a module_init and inserted it on the host. Always get zero for pmccntr_el0 reads. Or, if I set it to something non-zero with a write, then I always get that back - no increments. pmcr_el0 looks OK... I had forgotten to set bit 31 of pmcntenset_el0, but doing that still doesn't help. Anyway, I assume the problem is me. I'll keep looking to see what I'm missing. I returned to this and see that the problem was indeed me. I needed yet another enable bit set (the filter register needed to be instructed to count cycles while in el2). I've attached the code for the curious. The numbers are mean=6999, std_dev=242. Run on the host, or in a guest running on a host without this patch series (after TVM traps have been disabled), I get a pretty consistent 40. I checked how many vm-sysreg traps we do during the kernel compile benchmark. It's 124924. So it's a bit strange that we don't see the benchmark taking 10 to 20 seconds longer on average. I should probably double check my runs. In any case, while I like the approach of this series, the overhead is looking non-negligible. Thanks a lot for producing these numbers. 125k x 7k == 1 billion cycles == 1 second on a 1 GHz machine, I think? Or am I missing something? How long does the actual compile take? I ran a sequence of benchmarks that I occasionally run (pbzip, kernbench, and hackbench) and I also saw 1% performance degradation, so I think we can trust that somewhat. (I can post the raw numbers when I have ssh access to my Linux desktop - sending this from Somewhere Over The Atlantic). However, my concern with these patches are on two points: 1. It's not a fix-all. We still have the case where the guest expects the behavior of device memory (for strong ordering for example) on a RAM region, which we now break. Similiarly this doesn't support the non-coherent DMA to RAM region case. 2. While the code is probably as nice as this kind of stuff gets, it is non-trivial and extremely difficult to debug. The counter-point here is that we may end up handling other stuff at EL2 for performanc reasons in the future. Mainly because of point 1 above, I am leaning to thinking userspace should do the invalidation when it knows it needs to, either through KVM via a memslot flag or through some other syscall mechanism. I've started down the memslot flag road by promoting KVM_MEMSLOT_INCOHERENT to uapi/KVM_MEM_INCOHERENT, replacing the readonly memslot heuristic. With a couple more changes it should work for all memory regions with the 'incoherent' property. I'll make some changes to QEMU to test it all out as well. Progress was slow last week due to too many higher priority tasks, but I plan to return to it this week. Thanks, drew Thanks, -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote: On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote: On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? Guess not. I dropped the test into a module_init and inserted it on the host. Always get zero for pmccntr_el0 reads. Or, if I set it to something non-zero with a write, then I always get that back - no increments. pmcr_el0 looks OK... I had forgotten to set bit 31 of pmcntenset_el0, but doing that still doesn't help. Anyway, I assume the problem is me. I'll keep looking to see what I'm missing. I returned to this and see that the problem was indeed me. I needed yet another enable bit set (the filter register needed to be instructed to count cycles while in el2). I've attached the code for the curious. The numbers are mean=6999, std_dev=242. Run on the host, or in a guest running on a host without this patch series (after TVM traps have been disabled), I get a pretty consistent 40. I checked how many vm-sysreg traps we do during the kernel compile benchmark. It's 124924. So it's a bit strange that we don't see the benchmark taking 10 to 20 seconds longer on average. I should probably double check my runs. In any case, while I like the approach of this series, the overhead is looking non-negligible. drew #include libcflat.h static void prep_cc(void) { asm volatile( msr pmovsclr_el0, %0\n msr pmccfiltr_el0, %1\n msr pmcntenset_el0, %2\n msr pmcr_el0, %3\n isb\n : : r (1 31), r (1 27), r (1 31), r (1 6 | 1 2 | 1 0)); } int main(void) { unsigned long start, end; unsigned int sctlr; int i, zeros = 0; asm volatile(mrs %0, sctlr_el1 : =r (sctlr)); prep_cc(); for (i = 0; i 10; ++i) { asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n isb\n : =r (start), =r (end) : r (sctlr)); if ((i % 10) == 0) printf(\n); printf( %d, end - start); if ((end - start) == 0) { ++zeros; prep_cc(); } } printf(\nnum zero counts = %d\n, zeros); return 0; }
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote: On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote: On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote: On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? Guess not. I dropped the test into a module_init and inserted it on the host. Always get zero for pmccntr_el0 reads. Or, if I set it to something non-zero with a write, then I always get that back - no increments. pmcr_el0 looks OK... I had forgotten to set bit 31 of pmcntenset_el0, but doing that still doesn't help. Anyway, I assume the problem is me. I'll keep looking to see what I'm missing. I returned to this and see that the problem was indeed me. I needed yet another enable bit set (the filter register needed to be instructed to count cycles while in el2). I've attached the code for the curious. The numbers are mean=6999, std_dev=242. Run on the host, or in a guest running on a host without this patch series (after TVM traps have been disabled), I get a pretty consistent 40. I checked how many vm-sysreg traps we do during the kernel compile benchmark. It's 124924. So it's a bit strange that we don't see the benchmark taking 10 to 20 seconds longer on average. I should probably double check my runs. In any case, while I like the approach of this series, the overhead is looking non-negligible. Thanks a lot for producing these numbers. 125k x 7k == 1 billion cycles == 1 second on a 1 GHz machine, I think? Or am I missing something? How long does the actual compile take? -- Ard. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote: On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote: On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote: On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote: On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? Guess not. I dropped the test into a module_init and inserted it on the host. Always get zero for pmccntr_el0 reads. Or, if I set it to something non-zero with a write, then I always get that back - no increments. pmcr_el0 looks OK... I had forgotten to set bit 31 of pmcntenset_el0, but doing that still doesn't help. Anyway, I assume the problem is me. I'll keep looking to see what I'm missing. I returned to this and see that the problem was indeed me. I needed yet another enable bit set (the filter register needed to be instructed to count cycles while in el2). I've attached the code for the curious. The numbers are mean=6999, std_dev=242. Run on the host, or in a guest running on a host without this patch series (after TVM traps have been disabled), I get a pretty consistent 40. I checked how many vm-sysreg traps we do during the kernel compile benchmark. It's 124924. So it's a bit strange that we don't see the benchmark taking 10 to 20 seconds longer on average. I should probably double check my runs. In any case, while I like the approach of this series, the overhead is looking non-negligible. Thanks a lot for producing these numbers. 125k x 7k == 1 billion cycles == 1 second on a 1 GHz machine, I think? Or am I missing something? How long does the actual compile take? Wait, my fault. I dropped a pretty big divisor in my calculation. Don't ask... I'll just go home and study one of my daughter's math books now... So, I even have a 2.4 GHz machine, which explains why the benchmark times are the same with and without this series (those times are provided earlier in this thread, they're roughly 03:10). I'm glad you straighted me out. I was second guessing my benchmark results, and considering redoing them. Anyway, this series, at least wrt to overhead, is looking good again. Thanks, drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote: On 19/02/2015 18:55, Andrew Jones wrote: (I don't have an exact number for how many times it went to EL1 because access_mair() doesn't have a trace point.) (I got the 62873 number by testing a 3rd kernel build that only had patch 3/3 applied to the base, and counting kvm_toggle_cache events.) (The number 50 is the number of kvm_toggle_cache events *without* 3/3 applied.) I consider this bad news because, even considering it only goes to EL2, it goes a ton more than it used to. I realize patch 3/3 isn't the final plan for enabling traps though. If a full guest boots, can you try timing a kernel compile? Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4 times inside the guest for each host kernel; base and mair. I dropped the time from the first run of each set, and captured the other 3. Command line used below. Time is from the Elapsed (wall clock) time (h:mm:ss or m:ss): output of /usr/bin/time - the host's wall clock. /usr/bin/time --verbose ssh $VM 'cd kernel make -s clean make -s -j8' Results: base: 3:06.11 3:07.00 3:10.93 mair: 3:08.47 3:06.75 3:04.76 So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote: On 19/02/2015 18:55, Andrew Jones wrote: (I don't have an exact number for how many times it went to EL1 because access_mair() doesn't have a trace point.) (I got the 62873 number by testing a 3rd kernel build that only had patch 3/3 applied to the base, and counting kvm_toggle_cache events.) (The number 50 is the number of kvm_toggle_cache events *without* 3/3 applied.) I consider this bad news because, even considering it only goes to EL2, it goes a ton more than it used to. I realize patch 3/3 isn't the final plan for enabling traps though. If a full guest boots, can you try timing a kernel compile? Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4 times inside the guest for each host kernel; base and mair. I dropped the time from the first run of each set, and captured the other 3. Command line used below. Time is from the Elapsed (wall clock) time (h:mm:ss or m:ss): output of /usr/bin/time - the host's wall clock. /usr/bin/time --verbose ssh $VM 'cd kernel make -s clean make -s -j8' Results: base: 3:06.11 3:07.00 3:10.93 mair: 3:08.47 3:06.75 3:04.76 So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? -- Ard. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote: On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote: So looks like the 3 orders of magnitude greater number of traps (only to el2) don't impact kernel compiles. OK, good! That was what I was hoping for, obviously. Then I thought I'd be able to quick measure the number of cycles a trap to el2 takes with this kvm-unit-tests test int main(void) { unsigned long start, end; unsigned int sctlr; asm volatile( mrs %0, sctlr_el1\n msr pmcr_el0, %1\n : =r (sctlr) : r (5)); asm volatile( mrs %0, pmccntr_el0\n msr sctlr_el1, %2\n mrs %1, pmccntr_el0\n : =r (start), =r (end) : r (sctlr)); printf(%llx\n, end - start); return 0; } after applying this patch to kvm diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S index bb91b6fc63861..5de39d740aa58 100644 --- a/arch/arm64/kvm/hyp.S +++ b/arch/arm64/kvm/hyp.S @@ -770,7 +770,7 @@ mrs x2, mdcr_el2 and x2, x2, #MDCR_EL2_HPMN_MASK - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap But I get zero for the cycle count. Not sure what I'm missing. No clue tbh. Does the counter work as expected in the host? Guess not. I dropped the test into a module_init and inserted it on the host. Always get zero for pmccntr_el0 reads. Or, if I set it to something non-zero with a write, then I always get that back - no increments. pmcr_el0 looks OK... I had forgotten to set bit 31 of pmcntenset_el0, but doing that still doesn't help. Anyway, I assume the problem is me. I'll keep looking to see what I'm missing. drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- 1.8.3.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 19 February 2015 at 15:27, Alexander Graf ag...@suse.de wrote: On 19.02.15 15:56, Ard Biesheuvel wrote: On 19 February 2015 at 14:50, Alexander Graf ag...@suse.de wrote: On 19.02.15 11:54, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. Would you mind to give a brief explanation on what this does? What happens to actually assigned devices that need to be mapped as uncached? What happens to DMA from such devices when the guest assumes that it's accessing RAM uncached and then triggers DMA? On ARM, stage 2 mappings that are more strict will supersede stage 1 mappings, so the idea is to use cached mappings exclusively for stage 1 so that the host is fully in control of the actual memory attributes by setting the attributes at stage 2. This also makes sense because the host will ultimately know better whether some range that the guest thinks is a device is actually a device or just emulated (no stage 2 mapping), backed by host memory (such as the NOR flash read case) or backed by a passthrough device. Ok, so that means if the guest maps RAM as uncached, it will actually end up as cached memory. Now if the guest triggers a DMA request to a passed through device to that RAM, it will conflict with the cache. I don't know whether it's a big deal, but it's the scenario that came up with the approach above before when I talked to people about it. Well, I am using write-through read+write allocate, which hopefully means that the actual RAM is kept in sync with the cache, but I must confess I am a bit out of my depth here with the fine print in the ARM ARM. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 19.02.15 15:56, Ard Biesheuvel wrote: On 19 February 2015 at 14:50, Alexander Graf ag...@suse.de wrote: On 19.02.15 11:54, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. Would you mind to give a brief explanation on what this does? What happens to actually assigned devices that need to be mapped as uncached? What happens to DMA from such devices when the guest assumes that it's accessing RAM uncached and then triggers DMA? On ARM, stage 2 mappings that are more strict will supersede stage 1 mappings, so the idea is to use cached mappings exclusively for stage 1 so that the host is fully in control of the actual memory attributes by setting the attributes at stage 2. This also makes sense because the host will ultimately know better whether some range that the guest thinks is a device is actually a device or just emulated (no stage 2 mapping), backed by host memory (such as the NOR flash read case) or backed by a passthrough device. Ok, so that means if the guest maps RAM as uncached, it will actually end up as cached memory. Now if the guest triggers a DMA request to a passed through device to that RAM, it will conflict with the cache. I don't know whether it's a big deal, but it's the scenario that came up with the approach above before when I talked to people about it. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Thu, Feb 19, 2015 at 05:19:35PM +, Ard Biesheuvel wrote: On 19 February 2015 at 16:57, Andrew Jones drjo...@redhat.com wrote: On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled Hi Ard, I took this series for test drive. Unfortunately I have bad news and worse news. First, a description of the test; simply boot a guest, once at login, login, and then shutdown with 'poweroff'. The guest boots through AAVMF using a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio as cached' kludge. This test allows us to check for corrupt vram on the graphical console, plus it completes a boot/shutdown cycle allowing us to count sysreg traps of the boot/shutdown cycle. Thanks a lot for giving this a spin right away! So, the bad news Before this series we trapped 50 times on sysreg writes with the test described above. With this series we trap 62873 times. But, less than 20 required going to EL1. OK, this is very useful information. We still don't know what the penalty is of all those traps, but that's quite a big number indeed. (I don't have an exact number for how many times it went to EL1 because access_mair() doesn't have a trace point.) (I got the 62873 number by testing a 3rd kernel build that only had patch 3/3 applied to the base, and counting kvm_toggle_cache events.) (The number 50 is the number of kvm_toggle_cache events *without* 3/3 applied.) I consider this bad news because, even considering it only goes to EL2, it goes a ton more than it used to. I realize patch 3/3 isn't the final plan for enabling traps though. And, now the worse news The vram corruption persists with this patch series. OK, so the primary difference is that I am not substituting for write back mappings, as Laszlo is doing in his patch. If you have energy left, would you mind having another go but use 0xff (not 0xbb) for the MAIR values in patch #2? Yup, a bit energy left, and, yup, 0xff fixes it. Thanks, drew arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- 1.8.3.2 ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 19 feb. 2015, at 17:55, Andrew Jones drjo...@redhat.com wrote: On Thu, Feb 19, 2015 at 05:19:35PM +, Ard Biesheuvel wrote: On 19 February 2015 at 16:57, Andrew Jones drjo...@redhat.com wrote: On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled Hi Ard, I took this series for test drive. Unfortunately I have bad news and worse news. First, a description of the test; simply boot a guest, once at login, login, and then shutdown with 'poweroff'. The guest boots through AAVMF using a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio as cached' kludge. This test allows us to check for corrupt vram on the graphical console, plus it completes a boot/shutdown cycle allowing us to count sysreg traps of the boot/shutdown cycle. Thanks a lot for giving this a spin right away! So, the bad news Before this series we trapped 50 times on sysreg writes with the test described above. With this series we trap 62873 times. But, less than 20 required going to EL1. OK, this is very useful information. We still don't know what the penalty is of all those traps, but that's quite a big number indeed. (I don't have an exact number for how many times it went to EL1 because access_mair() doesn't have a trace point.) (I got the 62873 number by testing a 3rd kernel build that only had patch 3/3 applied to the base, and counting kvm_toggle_cache events.) (The number 50 is the number of kvm_toggle_cache events *without* 3/3 applied.) I consider this bad news because, even considering it only goes to EL2, it goes a ton more than it used to. I realize patch 3/3 isn't the final plan for enabling traps though. And, now the worse news The vram corruption persists with this patch series. OK, so the primary difference is that I am not substituting for write back mappings, as Laszlo is doing in his patch. If you have energy left, would you mind having another go but use 0xff (not 0xbb) for the MAIR values in patch #2? Yup, a bit energy left, and, yup, 0xff fixes it ok so that means we'd need to map as writeback cacheable by default, and restrict it as necessary at stage 2. thanks Thanks, drew arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- 1.8.3.2 ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 19 February 2015 at 16:57, Andrew Jones drjo...@redhat.com wrote: On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled Hi Ard, I took this series for test drive. Unfortunately I have bad news and worse news. First, a description of the test; simply boot a guest, once at login, login, and then shutdown with 'poweroff'. The guest boots through AAVMF using a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio as cached' kludge. This test allows us to check for corrupt vram on the graphical console, plus it completes a boot/shutdown cycle allowing us to count sysreg traps of the boot/shutdown cycle. Thanks a lot for giving this a spin right away! So, the bad news Before this series we trapped 50 times on sysreg writes with the test described above. With this series we trap 62873 times. But, less than 20 required going to EL1. OK, this is very useful information. We still don't know what the penalty is of all those traps, but that's quite a big number indeed. (I don't have an exact number for how many times it went to EL1 because access_mair() doesn't have a trace point.) (I got the 62873 number by testing a 3rd kernel build that only had patch 3/3 applied to the base, and counting kvm_toggle_cache events.) (The number 50 is the number of kvm_toggle_cache events *without* 3/3 applied.) I consider this bad news because, even considering it only goes to EL2, it goes a ton more than it used to. I realize patch 3/3 isn't the final plan for enabling traps though. And, now the worse news The vram corruption persists with this patch series. OK, so the primary difference is that I am not substituting for write back mappings, as Laszlo is doing in his patch. If you have energy left, would you mind having another go but use 0xff (not 0xbb) for the MAIR values in patch #2? arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- 1.8.3.2 ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 19.02.15 11:54, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. Would you mind to give a brief explanation on what this does? What happens to actually assigned devices that need to be mapped as uncached? What happens to DMA from such devices when the guest assumes that it's accessing RAM uncached and then triggers DMA? Alex The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 19 February 2015 at 14:50, Alexander Graf ag...@suse.de wrote: On 19.02.15 11:54, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. Would you mind to give a brief explanation on what this does? What happens to actually assigned devices that need to be mapped as uncached? What happens to DMA from such devices when the guest assumes that it's accessing RAM uncached and then triggers DMA? On ARM, stage 2 mappings that are more strict will supersede stage 1 mappings, so the idea is to use cached mappings exclusively for stage 1 so that the host is fully in control of the actual memory attributes by setting the attributes at stage 2. This also makes sense because the host will ultimately know better whether some range that the guest thinks is a device is actually a device or just emulated (no stage 2 mapping), backed by host memory (such as the NOR flash read case) or backed by a passthrough device. -- Ard. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote: This is a 0th order approximation of how we could potentially force the guest to avoid uncached mappings, at least from the moment the MMU is on. (Before that, all of memory is implicitly classified as Device-nGnRnE) The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings with cached ones. This way, there is no need to mangle any guest page tables. The downside is that, to do this correctly, we need to always trap writes to the VM sysreg group, which includes registers that the guest may write to very often. To reduce the associated performance hit, patch #1 introduces a fast path for EL2 to perform trivial sysreg writes on behalf of the guest, without the need for a full world switch to the host and back. The main purpose of these patches is to quantify the performance hit, and verify whether the MAIR_EL1 handling works correctly. Ard Biesheuvel (3): arm64: KVM: handle some sysreg writes in EL2 arm64: KVM: mangle MAIR register to prevent uncached guest mappings arm64: KVM: keep trapping of VM sysreg writes enabled Hi Ard, I took this series for test drive. Unfortunately I have bad news and worse news. First, a description of the test; simply boot a guest, once at login, login, and then shutdown with 'poweroff'. The guest boots through AAVMF using a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio as cached' kludge. This test allows us to check for corrupt vram on the graphical console, plus it completes a boot/shutdown cycle allowing us to count sysreg traps of the boot/shutdown cycle. So, the bad news Before this series we trapped 50 times on sysreg writes with the test described above. With this series we trap 62873 times. But, less than 20 required going to EL1. (I don't have an exact number for how many times it went to EL1 because access_mair() doesn't have a trace point.) (I got the 62873 number by testing a 3rd kernel build that only had patch 3/3 applied to the base, and counting kvm_toggle_cache events.) (The number 50 is the number of kvm_toggle_cache events *without* 3/3 applied.) I consider this bad news because, even considering it only goes to EL2, it goes a ton more than it used to. I realize patch 3/3 isn't the final plan for enabling traps though. And, now the worse news The vram corruption persists with this patch series. drew arch/arm/kvm/mmu.c | 2 +- arch/arm64/include/asm/kvm_arm.h | 2 +- arch/arm64/kvm/hyp.S | 101 +++ arch/arm64/kvm/sys_regs.c| 63 4 files changed, 156 insertions(+), 12 deletions(-) -- 1.8.3.2 ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings
On 19/02/2015 18:55, Andrew Jones wrote: (I don't have an exact number for how many times it went to EL1 because access_mair() doesn't have a trace point.) (I got the 62873 number by testing a 3rd kernel build that only had patch 3/3 applied to the base, and counting kvm_toggle_cache events.) (The number 50 is the number of kvm_toggle_cache events *without* 3/3 applied.) I consider this bad news because, even considering it only goes to EL2, it goes a ton more than it used to. I realize patch 3/3 isn't the final plan for enabling traps though. If a full guest boots, can you try timing a kernel compile? Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html