subject:"\[RFC\/RFT PATCH 0\/3\] arm64\: KVM\: work around incoherency with uncached guest mappings"

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-09 Thread Mario Smarduch

On 03/09/2015 07:26 AM, Andrew Jones wrote:
 On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
 On 03/05/2015 09:43 AM, Paolo Bonzini wrote:


 On 05/03/2015 15:58, Catalin Marinas wrote:
 It would especially suck if the user has a cluster with different
 machines, some of them coherent and others non-coherent, and then has to
 debug why the same configuration works on some machines and not on others.

 That's a problem indeed, especially with guest migration. But I don't
 think we have any sane solution here for the bus master DMA.

 I do not oppose doing cache management in QEMU for bus master DMA
 (though if the solution you outlined below works it would be great).

 ARM can override them as well but only making them stricter. Otherwise,
 on a weakly ordered architecture, it's not always safe (let's say the
 guest thinks it accesses Strongly Ordered memory and avoids barriers for
 flag updates but the host upgrades it to Cacheable which breaks the
 memory order).

 The same can happen on x86 though, even if it's rarer.  You still need a
 barrier between stores and loads.

 If we want the host to enforce guest memory mapping attributes via stage
 2, we could do it the other way around: get the guests to always assume
 full cache coherency, generating Normal Cacheable mappings, but use the
 stage 2 attributes restriction in the host to make such mappings
 non-cacheable when needed (it works this way on ARM but not in the other
 direction to relax the attributes).

 That sounds like a plan for device assignment.  But it still would not
 solve the problem of the MMIO framebuffer, right?

 The problem arises with MMIO areas that the guest can reasonably expect
 to be uncacheable, but that are optimized by the host so that they end
 up backed by cacheable RAM.  It's perfectly reasonable that the same
 device needs cacheable mapping with one userspace, and works with
 uncacheable mapping with another userspace that doesn't optimize the
 MMIO area to RAM.

 Unless the guest allocates the framebuffer itself (e.g.
 dma_alloc_coherent), we can't control the cacheability via
 dma-coherent properties as it refers to bus master DMA.

 Okay, it's good to rule that out.  One less thing to think about. :)
 Same for _DSD.

 So for MMIO with the buffer allocated by the host (Qemu), the only
 solution I see on ARM is for the host to ensure coherency, either via
 explicit cache maintenance (new KVM API) or by changing the memory
 attributes used by Qemu to access such virtual MMIO.

 Basically Qemu is acting as a bus master when reading the framebuffer it
 allocated but the guest considers it a slave access and we don't have a
 way to tell the guest that such accesses should be cacheable, nor can we
 upgrade them via architecture features.

 Yes, that's a way to put it.

 In practice, the VGA framebuffer has an optimization that uses dirty
 page tracking, so we could piggyback on the ioctls that return which
 pages are dirty.  It turns out that piggybacking on those ioctls also
 should fix the case of migrating a guest while the MMU is disabled.

 Yes, Qemu would need to invalidate the cache before reading a dirty
 framebuffer page.

 As I said above, an API that allows non-cacheable mappings for the VGA
 framebuffer in Qemu would also solve the problem. I'm not sure what KVM
 provides here (or whether we can add such API).

 Nothing for now; other architectures simply do not have the issue.

 As long as it's just VGA, we can quirk it.  There's just a couple
 vendor/device IDs to catch, and the guest can then use a cacheable mapping.

 For a more generic solution, the API would be madvise(MADV_DONTCACHE).
 It would be easy for QEMU to use it, but I am not too optimistic about
 convincing the mm folks about it.  We can try.
 
 I forgot to list this one in my summary of approaches[*]. This is a
 nice, clean approach. Avoids getting cache maintenance into everything.
 However, besides the difficulty to get it past mm people, it reduces
 performance for any userspace-userspace uses/sharing of the memory.
 userspace-guest requires cache maintenance, but nothing else. Maybe
 that's not an important concern for the few emulated devices that need
 it though.
 

 Interested to see the outcome.

 I was thinking of a very basic memory driver that can provide
 an uncached memslot to QEMU - in mmap() file operation
 apply pgprot_uncached to allocated pages, lock them, flush TLB
 call remap_pfn_range().
 
 I guess this is the same as the madvise approach, but with a driver.
 KVM could take this approach itself when memslots are added/updated
 with the INCOHERENT flag. Maybe worth some experimental patches to
 find out?

I would work on this but I'm tied up for next 3 weeks.
If anyone is interested I can provide base code, I used
it for memory passthrough although testing may be time consuming.
I think the hurdle here is the kernel doesn't map these
for any reason like page migration, locking pages should
tell kernel don't

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-09 Thread Andrew Jones

On Fri, Mar 06, 2015 at 01:08:29PM -0800, Mario Smarduch wrote:
 On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
  
  
  On 05/03/2015 15:58, Catalin Marinas wrote:
  It would especially suck if the user has a cluster with different
  machines, some of them coherent and others non-coherent, and then has to
  debug why the same configuration works on some machines and not on others.
 
  That's a problem indeed, especially with guest migration. But I don't
  think we have any sane solution here for the bus master DMA.
  
  I do not oppose doing cache management in QEMU for bus master DMA
  (though if the solution you outlined below works it would be great).
  
  ARM can override them as well but only making them stricter. Otherwise,
  on a weakly ordered architecture, it's not always safe (let's say the
  guest thinks it accesses Strongly Ordered memory and avoids barriers for
  flag updates but the host upgrades it to Cacheable which breaks the
  memory order).
  
  The same can happen on x86 though, even if it's rarer.  You still need a
  barrier between stores and loads.
  
  If we want the host to enforce guest memory mapping attributes via stage
  2, we could do it the other way around: get the guests to always assume
  full cache coherency, generating Normal Cacheable mappings, but use the
  stage 2 attributes restriction in the host to make such mappings
  non-cacheable when needed (it works this way on ARM but not in the other
  direction to relax the attributes).
  
  That sounds like a plan for device assignment.  But it still would not
  solve the problem of the MMIO framebuffer, right?
  
  The problem arises with MMIO areas that the guest can reasonably expect
  to be uncacheable, but that are optimized by the host so that they end
  up backed by cacheable RAM.  It's perfectly reasonable that the same
  device needs cacheable mapping with one userspace, and works with
  uncacheable mapping with another userspace that doesn't optimize the
  MMIO area to RAM.
 
  Unless the guest allocates the framebuffer itself (e.g.
  dma_alloc_coherent), we can't control the cacheability via
  dma-coherent properties as it refers to bus master DMA.
  
  Okay, it's good to rule that out.  One less thing to think about. :)
  Same for _DSD.
  
  So for MMIO with the buffer allocated by the host (Qemu), the only
  solution I see on ARM is for the host to ensure coherency, either via
  explicit cache maintenance (new KVM API) or by changing the memory
  attributes used by Qemu to access such virtual MMIO.
 
  Basically Qemu is acting as a bus master when reading the framebuffer it
  allocated but the guest considers it a slave access and we don't have a
  way to tell the guest that such accesses should be cacheable, nor can we
  upgrade them via architecture features.
  
  Yes, that's a way to put it.
  
  In practice, the VGA framebuffer has an optimization that uses dirty
  page tracking, so we could piggyback on the ioctls that return which
  pages are dirty.  It turns out that piggybacking on those ioctls also
  should fix the case of migrating a guest while the MMU is disabled.
 
  Yes, Qemu would need to invalidate the cache before reading a dirty
  framebuffer page.
 
  As I said above, an API that allows non-cacheable mappings for the VGA
  framebuffer in Qemu would also solve the problem. I'm not sure what KVM
  provides here (or whether we can add such API).
  
  Nothing for now; other architectures simply do not have the issue.
  
  As long as it's just VGA, we can quirk it.  There's just a couple
  vendor/device IDs to catch, and the guest can then use a cacheable mapping.
  
  For a more generic solution, the API would be madvise(MADV_DONTCACHE).
  It would be easy for QEMU to use it, but I am not too optimistic about
  convincing the mm folks about it.  We can try.

I forgot to list this one in my summary of approaches[*]. This is a
nice, clean approach. Avoids getting cache maintenance into everything.
However, besides the difficulty to get it past mm people, it reduces
performance for any userspace-userspace uses/sharing of the memory.
userspace-guest requires cache maintenance, but nothing else. Maybe
that's not an important concern for the few emulated devices that need
it though.

 
 Interested to see the outcome.
 
 I was thinking of a very basic memory driver that can provide
 an uncached memslot to QEMU - in mmap() file operation
 apply pgprot_uncached to allocated pages, lock them, flush TLB
 call remap_pfn_range().

I guess this is the same as the madvise approach, but with a driver.
KVM could take this approach itself when memslots are added/updated
with the INCOHERENT flag. Maybe worth some experimental patches to
find out?

I'm still thinking about experimenting with the ARM private syscalls
next though.

drew

[*] http://lists.gnu.org/archive/html/qemu-devel/2015-03/msg01254.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-06 Thread Mario Smarduch

On 03/04/2015 03:35 AM, Catalin Marinas wrote:
 (please try to avoid top-posting)
 
 On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
 On 03/02/2015 08:31 AM, Christoffer Dall wrote:
 However, my concern with these patches are on two points:

 1. It's not a fix-all.  We still have the case where the guest expects
 the behavior of device memory (for strong ordering for example) on a RAM
 region, which we now break.  Similiarly this doesn't support the
 non-coherent DMA to RAM region case.

 2. While the code is probably as nice as this kind of stuff gets, it
 is non-trivial and extremely difficult to debug.  The counter-point here
 is that we may end up handling other stuff at EL2 for performanc reasons
 in the future.

 Mainly because of point 1 above, I am leaning to thinking userspace
 should do the invalidation when it knows it needs to, either through KVM
 via a memslot flag or through some other syscall mechanism.
 
 I expressed my concerns as well, I'm definitely against merging this
 series.
 
 I don't understand how can the CPU handle different cache attributes
 used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
 cache evictions or cleans wipe out guest updates to same cache
 line(s)?
 
 Clean+invalidate is a safe operation even if the guest accesses the
 memory in a cacheable way. But if the guest can update the cache lines,
 Qemu should avoid cache maintenance from a performance perspective.
 
 The guest is either told that the DMA is coherent (via DT properties) or
 Qemu deals with (non-)coherency itself. The latter is fully in line with
 the B2.9 chapter in the ARM ARM, more precisely point 5:
 
   If the mismatched attributes for a memory location all assign the same
   shareability attribute to the location, any loss of uniprocessor
   semantics or coherency within a shareability domain can be avoided by
   use of software cache management.
 
 ... it continues with what kind of cache maintenance is required,
 together with:
 
   A clean and invalidate instruction can be used instead of a clean
   instruction, or instead of an invalidate instruction.
 
Hi Catalin,
  sorry for the top posting. I'm struggling with QEMU cache
maintenance for devices that don't have registers cache line aligned
and may be multi-function, for lack of a better
one I thought of sp804 that supports two devices with registers
covered by one cache line. Wouldn't QEMU cache maintenance
of one device have potential to corrupt the second device?
These could be used by two guest threads in parallel.

I get bullet 2,3 still working on 1st one it will take a while.

Thanks,
- Mario
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-06 Thread Mario Smarduch

On 03/05/2015 09:43 AM, Paolo Bonzini wrote:
 
 
 On 05/03/2015 15:58, Catalin Marinas wrote:
 It would especially suck if the user has a cluster with different
 machines, some of them coherent and others non-coherent, and then has to
 debug why the same configuration works on some machines and not on others.

 That's a problem indeed, especially with guest migration. But I don't
 think we have any sane solution here for the bus master DMA.
 
 I do not oppose doing cache management in QEMU for bus master DMA
 (though if the solution you outlined below works it would be great).
 
 ARM can override them as well but only making them stricter. Otherwise,
 on a weakly ordered architecture, it's not always safe (let's say the
 guest thinks it accesses Strongly Ordered memory and avoids barriers for
 flag updates but the host upgrades it to Cacheable which breaks the
 memory order).
 
 The same can happen on x86 though, even if it's rarer.  You still need a
 barrier between stores and loads.
 
 If we want the host to enforce guest memory mapping attributes via stage
 2, we could do it the other way around: get the guests to always assume
 full cache coherency, generating Normal Cacheable mappings, but use the
 stage 2 attributes restriction in the host to make such mappings
 non-cacheable when needed (it works this way on ARM but not in the other
 direction to relax the attributes).
 
 That sounds like a plan for device assignment.  But it still would not
 solve the problem of the MMIO framebuffer, right?
 
 The problem arises with MMIO areas that the guest can reasonably expect
 to be uncacheable, but that are optimized by the host so that they end
 up backed by cacheable RAM.  It's perfectly reasonable that the same
 device needs cacheable mapping with one userspace, and works with
 uncacheable mapping with another userspace that doesn't optimize the
 MMIO area to RAM.

 Unless the guest allocates the framebuffer itself (e.g.
 dma_alloc_coherent), we can't control the cacheability via
 dma-coherent properties as it refers to bus master DMA.
 
 Okay, it's good to rule that out.  One less thing to think about. :)
 Same for _DSD.
 
 So for MMIO with the buffer allocated by the host (Qemu), the only
 solution I see on ARM is for the host to ensure coherency, either via
 explicit cache maintenance (new KVM API) or by changing the memory
 attributes used by Qemu to access such virtual MMIO.

 Basically Qemu is acting as a bus master when reading the framebuffer it
 allocated but the guest considers it a slave access and we don't have a
 way to tell the guest that such accesses should be cacheable, nor can we
 upgrade them via architecture features.
 
 Yes, that's a way to put it.
 
 In practice, the VGA framebuffer has an optimization that uses dirty
 page tracking, so we could piggyback on the ioctls that return which
 pages are dirty.  It turns out that piggybacking on those ioctls also
 should fix the case of migrating a guest while the MMU is disabled.

 Yes, Qemu would need to invalidate the cache before reading a dirty
 framebuffer page.

 As I said above, an API that allows non-cacheable mappings for the VGA
 framebuffer in Qemu would also solve the problem. I'm not sure what KVM
 provides here (or whether we can add such API).
 
 Nothing for now; other architectures simply do not have the issue.
 
 As long as it's just VGA, we can quirk it.  There's just a couple
 vendor/device IDs to catch, and the guest can then use a cacheable mapping.
 
 For a more generic solution, the API would be madvise(MADV_DONTCACHE).
 It would be easy for QEMU to use it, but I am not too optimistic about
 convincing the mm folks about it.  We can try.

Interested to see the outcome.

I was thinking of a very basic memory driver that can provide
an uncached memslot to QEMU - in mmap() file operation
apply pgprot_uncached to allocated pages, lock them, flush TLB
call remap_pfn_range().

Mario

 
 Paolo
 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Paolo Bonzini



On 05/03/2015 15:58, Catalin Marinas wrote:
 It would especially suck if the user has a cluster with different
 machines, some of them coherent and others non-coherent, and then has to
 debug why the same configuration works on some machines and not on others.
 
 That's a problem indeed, especially with guest migration. But I don't
 think we have any sane solution here for the bus master DMA.

I do not oppose doing cache management in QEMU for bus master DMA
(though if the solution you outlined below works it would be great).

 ARM can override them as well but only making them stricter. Otherwise,
 on a weakly ordered architecture, it's not always safe (let's say the
 guest thinks it accesses Strongly Ordered memory and avoids barriers for
 flag updates but the host upgrades it to Cacheable which breaks the
 memory order).

The same can happen on x86 though, even if it's rarer.  You still need a
barrier between stores and loads.

 If we want the host to enforce guest memory mapping attributes via stage
 2, we could do it the other way around: get the guests to always assume
 full cache coherency, generating Normal Cacheable mappings, but use the
 stage 2 attributes restriction in the host to make such mappings
 non-cacheable when needed (it works this way on ARM but not in the other
 direction to relax the attributes).

That sounds like a plan for device assignment.  But it still would not
solve the problem of the MMIO framebuffer, right?

 The problem arises with MMIO areas that the guest can reasonably expect
 to be uncacheable, but that are optimized by the host so that they end
 up backed by cacheable RAM.  It's perfectly reasonable that the same
 device needs cacheable mapping with one userspace, and works with
 uncacheable mapping with another userspace that doesn't optimize the
 MMIO area to RAM.
 
 Unless the guest allocates the framebuffer itself (e.g.
 dma_alloc_coherent), we can't control the cacheability via
 dma-coherent properties as it refers to bus master DMA.

Okay, it's good to rule that out.  One less thing to think about. :)
Same for _DSD.

 So for MMIO with the buffer allocated by the host (Qemu), the only
 solution I see on ARM is for the host to ensure coherency, either via
 explicit cache maintenance (new KVM API) or by changing the memory
 attributes used by Qemu to access such virtual MMIO.
 
 Basically Qemu is acting as a bus master when reading the framebuffer it
 allocated but the guest considers it a slave access and we don't have a
 way to tell the guest that such accesses should be cacheable, nor can we
 upgrade them via architecture features.

Yes, that's a way to put it.

 In practice, the VGA framebuffer has an optimization that uses dirty
 page tracking, so we could piggyback on the ioctls that return which
 pages are dirty.  It turns out that piggybacking on those ioctls also
 should fix the case of migrating a guest while the MMU is disabled.
 
 Yes, Qemu would need to invalidate the cache before reading a dirty
 framebuffer page.
 
 As I said above, an API that allows non-cacheable mappings for the VGA
 framebuffer in Qemu would also solve the problem. I'm not sure what KVM
 provides here (or whether we can add such API).

Nothing for now; other architectures simply do not have the issue.

As long as it's just VGA, we can quirk it.  There's just a couple
vendor/device IDs to catch, and the guest can then use a cacheable mapping.

For a more generic solution, the API would be madvise(MADV_DONTCACHE).
It would be easy for QEMU to use it, but I am not too optimistic about
convincing the mm folks about it.  We can try.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Ard Biesheuvel

On 5 March 2015 at 15:58, Catalin Marinas catalin.mari...@arm.com wrote:
 On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
 On 05/03/2015 13:03, Catalin Marinas wrote:
   I'd hate to have to do that. PCI should be entirely probeable
   given that we tell the guest where the host bridge is, that's
   one of its advantages.
  I didn't say a DT node per device, the DT doesn't know what PCI devices
  are available (otherwise it defeats the idea of probing). But we need to
  tell the OS where the host bridge is via DT.
 
  So the guest would be told about two host bridges: one for real devices
  and another for virtual devices. These can have different coherency
  properties.

 Yeah, and it would suck that the user needs to know the difference
 between the coherency proprties of the host bridges.

 The host needs to know about this, unless we assume full coherency on
 all the platforms. Arguably, Qemu needs to know as well if it is the one
 generating the DT for guest (or at least passing some snippets from the
 host DT).

 It would especially suck if the user has a cluster with different
 machines, some of them coherent and others non-coherent, and then has to
 debug why the same configuration works on some machines and not on others.

 That's a problem indeed, especially with guest migration. But I don't
 think we have any sane solution here for the bus master DMA.

 To avoid replying in two different places, which of the solutions look
 to me like something that half-works?  Pretty much all of them, because
 in the end it is just a processor misfeature.  For example, Intel
 virtualization extensions let the hypervisor override stage1 translation
 _if necessary_.  AMD doesn't, but has some other quirky things that let
 you achieve the same effect..

 ARM can override them as well but only making them stricter. Otherwise,
 on a weakly ordered architecture, it's not always safe (let's say the
 guest thinks it accesses Strongly Ordered memory and avoids barriers for
 flag updates but the host upgrades it to Cacheable which breaks the
 memory order).

 If we want the host to enforce guest memory mapping attributes via stage
 2, we could do it the other way around: get the guests to always assume
 full cache coherency, generating Normal Cacheable mappings, but use the
 stage 2 attributes restriction in the host to make such mappings
 non-cacheable when needed (it works this way on ARM but not in the other
 direction to relax the attributes).


This was precisely the idea of the MAIR mangling patch: promote all
stage1 mappings to cacheable, and put the host in control by allowing
it to supersede them with device mappings in stage 2.
But it appears there are too many corner cases where this doesn't
quite work out.


 In particular, I am not even sure that this is about bus coherency,
 because this problem does not happen when the device is doing bus master
 DMA.  Working around coherency for bus master DMA would be easy.

 My previous emails on the dma-coherent property were only about bus
 master DMA (which would cause the correct selection of the DMA API ops
 in the guest).

 But even for bus master DMA, guest OS still needs to be aware of the
 (virtual) device DMA capabilities (cache coherent or not). You may be
 able to work around it in the host (stage 2, explicit cache flushing or
 SMMU attributes) if the guests assumes non-coherency but it's not really
 efficient (nor nice to implement).

 The problem arises with MMIO areas that the guest can reasonably expect
 to be uncacheable, but that are optimized by the host so that they end
 up backed by cacheable RAM.  It's perfectly reasonable that the same
 device needs cacheable mapping with one userspace, and works with
 uncacheable mapping with another userspace that doesn't optimize the
 MMIO area to RAM.

 Unless the guest allocates the framebuffer itself (e.g.
 dma_alloc_coherent), we can't control the cacheability via
 dma-coherent properties as it refers to bus master DMA.

 So for MMIO with the buffer allocated by the host (Qemu), the only
 solution I see on ARM is for the host to ensure coherency, either via
 explicit cache maintenance (new KVM API) or by changing the memory
 attributes used by Qemu to access such virtual MMIO.

 Basically Qemu is acting as a bus master when reading the framebuffer it
 allocated but the guest considers it a slave access and we don't have a
 way to tell the guest that such accesses should be cacheable, nor can we
 upgrade them via architecture features.

 Currently the VGA framebuffer is the main case where this happen, and I
 don't expect many more.  Because this is not bus master DMA, it's hard
 to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
 just reading from an array of chars.

 I now understand the problem better. I was under the impression that the
 guest allocates the framebuffer itself and tells Qemu where it is (like
 in amba-clcd.c for example).

 In practice, the VGA

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Peter Maydell

On 5 March 2015 at 20:04, Catalin Marinas catalin.mari...@arm.com wrote:
 On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
 On 04/03/2015 18:28, Catalin Marinas wrote:
   Can you add that property to the device tree for PCI devices too?
 
  Yes but not with mainline yet:
 
  http://thread.gmane.org/gmane.linux.kernel.iommu/8935
 
  We can add the property at the PCI host bridge level (with the drawback
  that it covers all the PCI devices), like here:

 Even covering all PCI devices is not enough if we want to support device
 assignment of PCI host devices.

 Can we not have another PCI bridge node in the DT for the host device
 assignments?

I'd hate to have to do that. PCI should be entirely probeable
given that we tell the guest where the host bridge is, that's
one of its advantages.

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Catalin Marinas

On Thu, Mar 05, 2015 at 08:52:49PM +0900, Peter Maydell wrote:
 On 5 March 2015 at 20:04, Catalin Marinas catalin.mari...@arm.com wrote:
  On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
  On 04/03/2015 18:28, Catalin Marinas wrote:
Can you add that property to the device tree for PCI devices too?
  
   Yes but not with mainline yet:
  
   http://thread.gmane.org/gmane.linux.kernel.iommu/8935
  
   We can add the property at the PCI host bridge level (with the drawback
   that it covers all the PCI devices), like here:
 
  Even covering all PCI devices is not enough if we want to support device
  assignment of PCI host devices.
 
  Can we not have another PCI bridge node in the DT for the host device
  assignments?
 
 I'd hate to have to do that. PCI should be entirely probeable
 given that we tell the guest where the host bridge is, that's
 one of its advantages.

I didn't say a DT node per device, the DT doesn't know what PCI devices
are available (otherwise it defeats the idea of probing). But we need to
tell the OS where the host bridge is via DT.

So the guest would be told about two host bridges: one for real devices
and another for virtual devices. These can have different coherency
properties.

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Paolo Bonzini



On 04/03/2015 18:28, Catalin Marinas wrote:
  
  Can you add that property to the device tree for PCI devices too?
 Yes but not with mainline yet:
 
 http://thread.gmane.org/gmane.linux.kernel.iommu/8935
 
 We can add the property at the PCI host bridge level (with the drawback
 that it covers all the PCI devices), like here:

Even covering all PCI devices is not enough if we want to support device
assignment of PCI host devices.  I'd rather not spend effort on a
solution that we know will only half-work a few months down the road.

Paolo

 Documentation/devicetree/bindings/pci/xgene-pci.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Catalin Marinas

On Thu, Mar 05, 2015 at 11:12:22AM +0100, Paolo Bonzini wrote:
 On 04/03/2015 18:28, Catalin Marinas wrote:
   Can you add that property to the device tree for PCI devices too?
 
  Yes but not with mainline yet:
  
  http://thread.gmane.org/gmane.linux.kernel.iommu/8935
  
  We can add the property at the PCI host bridge level (with the drawback
  that it covers all the PCI devices), like here:
 
 Even covering all PCI devices is not enough if we want to support device
 assignment of PCI host devices. 

Can we not have another PCI bridge node in the DT for the host device
assignments?

 I'd rather not spend effort on a solution that we know will only
 half-work a few months down the road.

Which of the solutions are you referring to? On the Applied Micro
boards, for example, the PCIe is coherent. If you do device assignment,
the guest must be aware of the coherency property of the physical device
and behave accordingly, there isn't much the host can do about it.

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Paolo Bonzini



On 05/03/2015 13:03, Catalin Marinas wrote:
  
  I'd hate to have to do that. PCI should be entirely probeable
  given that we tell the guest where the host bridge is, that's
  one of its advantages.
 I didn't say a DT node per device, the DT doesn't know what PCI devices
 are available (otherwise it defeats the idea of probing). But we need to
 tell the OS where the host bridge is via DT.
 
 So the guest would be told about two host bridges: one for real devices
 and another for virtual devices. These can have different coherency
 properties.

Yeah, and it would suck that the user needs to know the difference
between the coherency proprties of the host bridges.

It would especially suck if the user has a cluster with different
machines, some of them coherent and others non-coherent, and then has to
debug why the same configuration works on some machines and not on others.


To avoid replying in two different places, which of the solutions look
to me like something that half-works?  Pretty much all of them, because
in the end it is just a processor misfeature.  For example, Intel
virtualization extensions let the hypervisor override stage1 translation
_if necessary_.  AMD doesn't, but has some other quirky things that let
you achieve the same effect..

In particular, I am not even sure that this is about bus coherency,
because this problem does not happen when the device is doing bus master
DMA.  Working around coherency for bus master DMA would be easy.

The problem arises with MMIO areas that the guest can reasonably expect
to be uncacheable, but that are optimized by the host so that they end
up backed by cacheable RAM.  It's perfectly reasonable that the same
device needs cacheable mapping with one userspace, and works with
uncacheable mapping with another userspace that doesn't optimize the
MMIO area to RAM.

Currently the VGA framebuffer is the main case where this happen, and I
don't expect many more.  Because this is not bus master DMA, it's hard
to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
just reading from an array of chars.

In practice, the VGA framebuffer has an optimization that uses dirty
page tracking, so we could piggyback on the ioctls that return which
pages are dirty.  It turns out that piggybacking on those ioctls also
should fix the case of migrating a guest while the MMU is disabled.

But it's a hack, and it may not work for other devices.

We could use _DSD to export the device tree property separately for each
device, but that wouldn't work for hotplugged devices.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-05 Thread Catalin Marinas

On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
 On 05/03/2015 13:03, Catalin Marinas wrote:
   I'd hate to have to do that. PCI should be entirely probeable
   given that we tell the guest where the host bridge is, that's
   one of its advantages.
  I didn't say a DT node per device, the DT doesn't know what PCI devices
  are available (otherwise it defeats the idea of probing). But we need to
  tell the OS where the host bridge is via DT.
  
  So the guest would be told about two host bridges: one for real devices
  and another for virtual devices. These can have different coherency
  properties.
 
 Yeah, and it would suck that the user needs to know the difference
 between the coherency proprties of the host bridges.

The host needs to know about this, unless we assume full coherency on
all the platforms. Arguably, Qemu needs to know as well if it is the one
generating the DT for guest (or at least passing some snippets from the
host DT).

 It would especially suck if the user has a cluster with different
 machines, some of them coherent and others non-coherent, and then has to
 debug why the same configuration works on some machines and not on others.

That's a problem indeed, especially with guest migration. But I don't
think we have any sane solution here for the bus master DMA.

 To avoid replying in two different places, which of the solutions look
 to me like something that half-works?  Pretty much all of them, because
 in the end it is just a processor misfeature.  For example, Intel
 virtualization extensions let the hypervisor override stage1 translation
 _if necessary_.  AMD doesn't, but has some other quirky things that let
 you achieve the same effect..

ARM can override them as well but only making them stricter. Otherwise,
on a weakly ordered architecture, it's not always safe (let's say the
guest thinks it accesses Strongly Ordered memory and avoids barriers for
flag updates but the host upgrades it to Cacheable which breaks the
memory order).

If we want the host to enforce guest memory mapping attributes via stage
2, we could do it the other way around: get the guests to always assume
full cache coherency, generating Normal Cacheable mappings, but use the
stage 2 attributes restriction in the host to make such mappings
non-cacheable when needed (it works this way on ARM but not in the other
direction to relax the attributes).

 In particular, I am not even sure that this is about bus coherency,
 because this problem does not happen when the device is doing bus master
 DMA.  Working around coherency for bus master DMA would be easy.

My previous emails on the dma-coherent property were only about bus
master DMA (which would cause the correct selection of the DMA API ops
in the guest).

But even for bus master DMA, guest OS still needs to be aware of the
(virtual) device DMA capabilities (cache coherent or not). You may be
able to work around it in the host (stage 2, explicit cache flushing or
SMMU attributes) if the guests assumes non-coherency but it's not really
efficient (nor nice to implement).

 The problem arises with MMIO areas that the guest can reasonably expect
 to be uncacheable, but that are optimized by the host so that they end
 up backed by cacheable RAM.  It's perfectly reasonable that the same
 device needs cacheable mapping with one userspace, and works with
 uncacheable mapping with another userspace that doesn't optimize the
 MMIO area to RAM.

Unless the guest allocates the framebuffer itself (e.g.
dma_alloc_coherent), we can't control the cacheability via
dma-coherent properties as it refers to bus master DMA.

So for MMIO with the buffer allocated by the host (Qemu), the only
solution I see on ARM is for the host to ensure coherency, either via
explicit cache maintenance (new KVM API) or by changing the memory
attributes used by Qemu to access such virtual MMIO.

Basically Qemu is acting as a bus master when reading the framebuffer it
allocated but the guest considers it a slave access and we don't have a
way to tell the guest that such accesses should be cacheable, nor can we
upgrade them via architecture features.

 Currently the VGA framebuffer is the main case where this happen, and I
 don't expect many more.  Because this is not bus master DMA, it's hard
 to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
 just reading from an array of chars.

I now understand the problem better. I was under the impression that the
guest allocates the framebuffer itself and tells Qemu where it is (like
in amba-clcd.c for example).

 In practice, the VGA framebuffer has an optimization that uses dirty
 page tracking, so we could piggyback on the ioctls that return which
 pages are dirty.  It turns out that piggybacking on those ioctls also
 should fix the case of migrating a guest while the MMU is disabled.

Yes, Qemu would need to invalidate the cache before reading a dirty
framebuffer page.

As I said above, an API that allows

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Catalin Marinas

On Wed, Mar 04, 2015 at 06:03:11PM +0100, Paolo Bonzini wrote:
 
 
 On 04/03/2015 15:29, Catalin Marinas wrote:
  I disagree it is 100% a host-side issue. It is a host-side issue _if_
  the host tells the guest that the (virtual) device is non-coherent (or,
  more precisely, it does not explicitly tell the guest that the device is
  coherent). If the guest thinks the (virtual) device is non-coherent
  because of information passed by the host, I fully agree that the host
  needs to manage the cache coherency.
  
  However, the host could also pass a dma-coherent property in the DT
  given to the guest and avoid any form of cache maintenance. If the guest
  does not honour such coherency property, it's a guest problem and it
  needs fixing in the guest. This isn't any different from a real physical
  device behaviour.
 
 Can you add that property to the device tree for PCI devices too?

Yes but not with mainline yet:

http://thread.gmane.org/gmane.linux.kernel.iommu/8935

We can add the property at the PCI host bridge level (with the drawback
that it covers all the PCI devices), like here:

Documentation/devicetree/bindings/pci/xgene-pci.txt

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Paolo Bonzini



On 04/03/2015 15:29, Catalin Marinas wrote:
 I disagree it is 100% a host-side issue. It is a host-side issue _if_
 the host tells the guest that the (virtual) device is non-coherent (or,
 more precisely, it does not explicitly tell the guest that the device is
 coherent). If the guest thinks the (virtual) device is non-coherent
 because of information passed by the host, I fully agree that the host
 needs to manage the cache coherency.
 
 However, the host could also pass a dma-coherent property in the DT
 given to the guest and avoid any form of cache maintenance. If the guest
 does not honour such coherency property, it's a guest problem and it
 needs fixing in the guest. This isn't any different from a real physical
 device behaviour.

Can you add that property to the device tree for PCI devices too?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Catalin Marinas

(please try to avoid top-posting)

On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
 On 03/02/2015 08:31 AM, Christoffer Dall wrote:
  However, my concern with these patches are on two points:
  
  1. It's not a fix-all.  We still have the case where the guest expects
  the behavior of device memory (for strong ordering for example) on a RAM
  region, which we now break.  Similiarly this doesn't support the
  non-coherent DMA to RAM region case.
  
  2. While the code is probably as nice as this kind of stuff gets, it
  is non-trivial and extremely difficult to debug.  The counter-point here
  is that we may end up handling other stuff at EL2 for performanc reasons
  in the future.
  
  Mainly because of point 1 above, I am leaning to thinking userspace
  should do the invalidation when it knows it needs to, either through KVM
  via a memslot flag or through some other syscall mechanism.

I expressed my concerns as well, I'm definitely against merging this
series.

 I don't understand how can the CPU handle different cache attributes
 used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
 cache evictions or cleans wipe out guest updates to same cache
 line(s)?

Clean+invalidate is a safe operation even if the guest accesses the
memory in a cacheable way. But if the guest can update the cache lines,
Qemu should avoid cache maintenance from a performance perspective.

The guest is either told that the DMA is coherent (via DT properties) or
Qemu deals with (non-)coherency itself. The latter is fully in line with
the B2.9 chapter in the ARM ARM, more precisely point 5:

  If the mismatched attributes for a memory location all assign the same
  shareability attribute to the location, any loss of uniprocessor
  semantics or coherency within a shareability domain can be avoided by
  use of software cache management.

... it continues with what kind of cache maintenance is required,
together with:

  A clean and invalidate instruction can be used instead of a clean
  instruction, or instead of an invalidate instruction.

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Ard Biesheuvel

On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote:
 (please try to avoid top-posting)

 On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
 On 03/02/2015 08:31 AM, Christoffer Dall wrote:
  However, my concern with these patches are on two points:
 
  1. It's not a fix-all.  We still have the case where the guest expects
  the behavior of device memory (for strong ordering for example) on a RAM
  region, which we now break.  Similiarly this doesn't support the
  non-coherent DMA to RAM region case.
 
  2. While the code is probably as nice as this kind of stuff gets, it
  is non-trivial and extremely difficult to debug.  The counter-point here
  is that we may end up handling other stuff at EL2 for performanc reasons
  in the future.
 
  Mainly because of point 1 above, I am leaning to thinking userspace
  should do the invalidation when it knows it needs to, either through KVM
  via a memslot flag or through some other syscall mechanism.

 I expressed my concerns as well, I'm definitely against merging this
 series.


Don't worry, that was never the intention, at least not as-is :-)

I think we have established that the performance hit is not the
problem but the correctness is.

I do have a remaining question, though: my original [non-working]
approach was to replace uncached mappings with write-through
read-allocate write-allocate, which I expected would keep the caches
in sync with main memory, but apparently I am misunderstanding
something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
get it to work: it replaces WT/RA/WA with WB/RA/WA)

Is there no way to use write-through caching here?

 I don't understand how can the CPU handle different cache attributes
 used by QEMU and Guest won't you run into B2.9 checklist? Wouldn't
 cache evictions or cleans wipe out guest updates to same cache
 line(s)?

 Clean+invalidate is a safe operation even if the guest accesses the
 memory in a cacheable way. But if the guest can update the cache lines,
 Qemu should avoid cache maintenance from a performance perspective.

 The guest is either told that the DMA is coherent (via DT properties) or
 Qemu deals with (non-)coherency itself. The latter is fully in line with
 the B2.9 chapter in the ARM ARM, more precisely point 5:

   If the mismatched attributes for a memory location all assign the same
   shareability attribute to the location, any loss of uniprocessor
   semantics or coherency within a shareability domain can be avoided by
   use of software cache management.

 ... it continues with what kind of cache maintenance is required,
 together with:

   A clean and invalidate instruction can be used instead of a clean
   instruction, or instead of an invalidate instruction.

 --
 Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Catalin Marinas

On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
 On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote:
  On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
  On 03/02/2015 08:31 AM, Christoffer Dall wrote:
   However, my concern with these patches are on two points:
  
   1. It's not a fix-all.  We still have the case where the guest expects
   the behavior of device memory (for strong ordering for example) on a RAM
   region, which we now break.  Similiarly this doesn't support the
   non-coherent DMA to RAM region case.
  
   2. While the code is probably as nice as this kind of stuff gets, it
   is non-trivial and extremely difficult to debug.  The counter-point here
   is that we may end up handling other stuff at EL2 for performanc reasons
   in the future.
  
   Mainly because of point 1 above, I am leaning to thinking userspace
   should do the invalidation when it knows it needs to, either through KVM
   via a memslot flag or through some other syscall mechanism.
 
  I expressed my concerns as well, I'm definitely against merging this
  series.
 
 Don't worry, that was never the intention, at least not as-is :-)

I wasn't worried, just wanted to make my position clearer ;).

 I think we have established that the performance hit is not the
 problem but the correctness is.

I haven't looked at the performance figures but has anyone assessed the
hit caused by doing cache maintenance in Qemu vs cacheable guest
accesses (and no maintenance)?

 I do have a remaining question, though: my original [non-working]
 approach was to replace uncached mappings with write-through
 read-allocate write-allocate,

Does it make sense to have write-through and write-allocate at the same
time? The write-allocate hint would probably be ignored as write-through
writes do not generate linefills.

 which I expected would keep the caches
 in sync with main memory, but apparently I am misunderstanding
 something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
 get it to work: it replaces WT/RA/WA with WB/RA/WA)
 
 Is there no way to use write-through caching here?

Write-through is considered non-cacheable from a write perspective when
it does not hit in the cache. AFAIK, it should still be able to hit
existing cache lines and evict. The ARM ARM states that cache cleaning
to _PoU_ is not required for coherency when the writes are to
write-through memory but I have to dig further into the PoC because
that's what we care about here.

What platform did you test it on? I can't tell what the behaviour of
system caches is. I know they intercept explicit cache maintenance by VA
but not sure what happens to write-through writes when they hit in the
system cache (are they evicted to RAM or not?). If such write-through
writes are only evicted to the point-of-unification, they won't work
since non-cacheable accesses go all the way to PoC.

I need to do more reading through the ARM ARM, it should be hidden
somewhere ;).

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Peter Maydell

On 4 March 2015 at 23:29, Catalin Marinas catalin.mari...@arm.com wrote:
 I disagree it is 100% a host-side issue. It is a host-side issue _if_
 the host tells the guest that the (virtual) device is non-coherent (or,
 more precisely, it does not explicitly tell the guest that the device is
 coherent). If the guest thinks the (virtual) device is non-coherent
 because of information passed by the host, I fully agree that the host
 needs to manage the cache coherency.

 However, the host could also pass a dma-coherent property in the DT
 given to the guest and avoid any form of cache maintenance. If the guest
 does not honour such coherency property, it's a guest problem and it
 needs fixing in the guest. This isn't any different from a real physical
 device behaviour.

Right, and we should do that for things like virtio, because we want
the performance. But we also have devices (like vga framebuffers)
which shouldn't be handled as cacheable, so we need to be able to
deal with both situations.

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Catalin Marinas

On Wed, Mar 04, 2015 at 03:12:12PM +0100, Andrew Jones wrote:
 On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
  On 4 March 2015 at 13:29, Catalin Marinas catalin.mari...@arm.com wrote:
   On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
   I think we have established that the performance hit is not the
   problem but the correctness is.
  
   I haven't looked at the performance figures but has anyone assessed the
   hit caused by doing cache maintenance in Qemu vs cacheable guest
   accesses (and no maintenance)?
 
 I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
 Hopefully I'll send it out this evening. Tomorrow at the latest.
 Getting numbers of that approach vs. a guest's use of cached memory
 for devices would take a decent amount of additional work, so won't
 be part of that post.

OK.

 I'm actually not sure we should care about
 the numbers for a guest using normal mem attributes for device
 memory - other than out of curiosity. For correctness this issue
 really needs to be solved 100% host-side. We can't rely on a
 guest to do different/weird things, just because it's a guest.
 Ideally guests don't even know that they're guests. (Even if we
 describe the memory as cache-able to the guest, I don't think we
 can rely on the guest believing us.)

I disagree it is 100% a host-side issue. It is a host-side issue _if_
the host tells the guest that the (virtual) device is non-coherent (or,
more precisely, it does not explicitly tell the guest that the device is
coherent). If the guest thinks the (virtual) device is non-coherent
because of information passed by the host, I fully agree that the host
needs to manage the cache coherency.

However, the host could also pass a dma-coherent property in the DT
given to the guest and avoid any form of cache maintenance. If the guest
does not honour such coherency property, it's a guest problem and it
needs fixing in the guest. This isn't any different from a real physical
device behaviour.

(there are counter arguments for the latter as well like emulating old
platforms that never had coherency but from a performance/production
perspective, I strongly recommend that guests are passed the
dma-coherent property for such virtual devices)

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Ard Biesheuvel

On 4 March 2015 at 13:29, Catalin Marinas catalin.mari...@arm.com wrote:
 On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
 On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote:
  On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
  On 03/02/2015 08:31 AM, Christoffer Dall wrote:
   However, my concern with these patches are on two points:
  
   1. It's not a fix-all.  We still have the case where the guest expects
   the behavior of device memory (for strong ordering for example) on a RAM
   region, which we now break.  Similiarly this doesn't support the
   non-coherent DMA to RAM region case.
  
   2. While the code is probably as nice as this kind of stuff gets, it
   is non-trivial and extremely difficult to debug.  The counter-point here
   is that we may end up handling other stuff at EL2 for performanc reasons
   in the future.
  
   Mainly because of point 1 above, I am leaning to thinking userspace
   should do the invalidation when it knows it needs to, either through KVM
   via a memslot flag or through some other syscall mechanism.
 
  I expressed my concerns as well, I'm definitely against merging this
  series.

 Don't worry, that was never the intention, at least not as-is :-)

 I wasn't worried, just wanted to make my position clearer ;).

 I think we have established that the performance hit is not the
 problem but the correctness is.

 I haven't looked at the performance figures but has anyone assessed the
 hit caused by doing cache maintenance in Qemu vs cacheable guest
 accesses (and no maintenance)?


No, I don't think so. The performance hit I am referring to is the
performance hit caused by leaving the trapping of VM system register
writes enabled all the time, so that writes to MAIR_EL1 are always
caught. This is why patch #1 implements some of the sysreg write
handling in EL2

 I do have a remaining question, though: my original [non-working]
 approach was to replace uncached mappings with write-through
 read-allocate write-allocate,

 Does it make sense to have write-through and write-allocate at the same
 time? The write-allocate hint would probably be ignored as write-through
 writes do not generate linefills.


OK, that answers my question then. The availability of a
write-allocate setting on write-through attributes suggested to me
that writes would go to both the cache and main memory, so that the
write-back cached attribute the host is using for the same memory
would not result in it reading stale data.

 which I expected would keep the caches
 in sync with main memory, but apparently I am misunderstanding
 something here. (This is the reason for s/0xbb/0xff/ in patch #2 to
 get it to work: it replaces WT/RA/WA with WB/RA/WA)

 Is there no way to use write-through caching here?

 Write-through is considered non-cacheable from a write perspective when
 it does not hit in the cache. AFAIK, it should still be able to hit
 existing cache lines and evict. The ARM ARM states that cache cleaning
 to _PoU_ is not required for coherency when the writes are to
 write-through memory but I have to dig further into the PoC because
 that's what we care about here.

 What platform did you test it on? I can't tell what the behaviour of
 system caches is. I know they intercept explicit cache maintenance by VA
 but not sure what happens to write-through writes when they hit in the
 system cache (are they evicted to RAM or not?). If such write-through
 writes are only evicted to the point-of-unification, they won't work
 since non-cacheable accesses go all the way to PoC.


This was tested on APM, by Drew and Laszlo (thanks guys)

I have recently received a Seattle myself, but I haven't had time yet
to test these patches myself.

 I need to do more reading through the ARM ARM,

If you say so :-)

 it should be hidden
 somewhere ;).


-- 
Ard.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-04 Thread Andrew Jones

On Wed, Mar 04, 2015 at 01:43:02PM +0100, Ard Biesheuvel wrote:
 On 4 March 2015 at 13:29, Catalin Marinas catalin.mari...@arm.com wrote:
  On Wed, Mar 04, 2015 at 12:50:57PM +0100, Ard Biesheuvel wrote:
  On 4 March 2015 at 12:35, Catalin Marinas catalin.mari...@arm.com wrote:
   On Mon, Mar 02, 2015 at 06:20:19PM -0800, Mario Smarduch wrote:
   On 03/02/2015 08:31 AM, Christoffer Dall wrote:
However, my concern with these patches are on two points:
   
1. It's not a fix-all.  We still have the case where the guest expects
the behavior of device memory (for strong ordering for example) on a 
RAM
region, which we now break.  Similiarly this doesn't support the
non-coherent DMA to RAM region case.
   
2. While the code is probably as nice as this kind of stuff gets, it
is non-trivial and extremely difficult to debug.  The counter-point 
here
is that we may end up handling other stuff at EL2 for performanc 
reasons
in the future.
   
Mainly because of point 1 above, I am leaning to thinking userspace
should do the invalidation when it knows it needs to, either through 
KVM
via a memslot flag or through some other syscall mechanism.
  
   I expressed my concerns as well, I'm definitely against merging this
   series.
 
  Don't worry, that was never the intention, at least not as-is :-)
 
  I wasn't worried, just wanted to make my position clearer ;).
 
  I think we have established that the performance hit is not the
  problem but the correctness is.
 
  I haven't looked at the performance figures but has anyone assessed the
  hit caused by doing cache maintenance in Qemu vs cacheable guest
  accesses (and no maintenance)?
 

I'm working on a PoC of a QEMU/KVM cache maintenance approach now.
Hopefully I'll send it out this evening. Tomorrow at the latest.
Getting numbers of that approach vs. a guest's use of cached memory
for devices would take a decent amount of additional work, so won't
be part of that post. I'm actually not sure we should care about
the numbers for a guest using normal mem attributes for device
memory - other than out of curiosity. For correctness this issue
really needs to be solved 100% host-side. We can't rely on a
guest to do different/weird things, just because it's a guest.
Ideally guests don't even know that they're guests. (Even if we
describe the memory as cache-able to the guest, I don't think we
can rely on the guest believing us.)

drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-03 Thread Catalin Marinas

On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)

That's just for data accesses. IIRC instructions are cacheable on ARMv8
(though I think without allocation in the unified caches).

 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page tables.

There is another big downside to this - breaking the guest assumptions
about the (non-)cacheability of its mappings. It also only works for
guests that use MAIR_EL1 (LPAE).

We have two main cases where the guest and host cacheability do not
match:

1. During boot, as you said, when the MMU is off. What we have done in
   the guest kernel is to invalidate the data ranges that it writes with
   the MMU off in case there were any speculatively loaded cache lines
   via the cacheable mappings (in the host). We don't have any nice
   solution in the host here and MAIR_EL1 tweaking does not work

2. Guest explicitly creating a non-cacheable mapping (MMU enabled). Here
   we have two sub-cases:
   a) guest-only accesses to such mapping. The guest would need to
  perform cache maintenance as required if it ever accesses such
  memory via cacheable mappings (we do this already, see the
  streaming DMA API)
   b) memory shared with the host: e.g Qemu emulating DMA (frame buffer
  etc.)

This 2.b case is not any different than the OS dealing with a
(non-)coherent DMA-capable device. If the device is coherent, the
DMA buffer in the guest must be coherent as well, otherwise
non-coherent. Imagine a real VGA device that always snoops CPU caches.
You would not create a non-cacheable frame buffer mapping since the
device cannot see the updates and only read stale cache entries.

We don't (can't) have a safe set of DMA ops that would work in both
cases. So if Qemu cannot use a non-cacheable mapping or cannot perform
cache maintenance, the only solution is to tell the guest that such
virtual device is cache _coherent_. This also gives you better
performance overall anyway.

-- 
Catalin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-03 Thread Laszlo Ersek

On 03/03/15 18:34, Alexander Graf wrote:
 On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force
 the guest
 to avoid uncached mappings, at least from the moment the MMU is on.
 (Before
 that, all of memory is implicitly classified as Device-nGnRnE)

 The idea (patch #2) is to trap writes to MAIR_EL1, and replace
 uncached mappings
 with cached ones. This way, there is no need to mangle any guest page
 tables.

 The downside is that, to do this correctly, we need to always trap
 writes to
 the VM sysreg group, which includes registers that the guest may write
 to very
 often. To reduce the associated performance hit, patch #1 introduces a
 fast path
 for EL2 to perform trivial sysreg writes on behalf of the guest,
 without the
 need for a full world switch to the host and back.

 The main purpose of these patches is to quantify the performance hit, and
 verify whether the MAIR_EL1 handling works correctly.
 
 I gave this a quick spin on a VM running with QEMU.
 
   * VGA output is still distorted, I get random junk black lines in the
 output in between
   * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even
 boot up

Do you also have the dirty page tracking patches in your host kernel? I
needed both (and got them via Drew's backport, thanks) and then both VGA
and USB started working fine.

Without the MAIR patches, I got cache-line size random corruptions in
the VGA display (16 pixel wide small segments). Without dirty page
tracking, big chunks (sometimes even almost the entire screen) was blank.

Regarding USB, unless you have both of the patchsets in the host kernel,
the guest will indeed crash early during boot. Gerd confirmed for me
that usb controller (all uhci/ehci/xhci) pci regions see both read
(status bits) and write (control bits) access. So if there's any
corruption in there, on read, that looks like a malfunctioning piece of
hw for the guest kernel, and in this case it happens to crash.

 With TCG, both bits work fine.

Yep.

Thanks
Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-03 Thread Andrew Jones

On Tue, Mar 03, 2015 at 07:13:48PM +0100, Laszlo Ersek wrote:
 On 03/03/15 18:34, Alexander Graf wrote:
  On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:
  This is a 0th order approximation of how we could potentially force
  the guest
  to avoid uncached mappings, at least from the moment the MMU is on.
  (Before
  that, all of memory is implicitly classified as Device-nGnRnE)
 
  The idea (patch #2) is to trap writes to MAIR_EL1, and replace
  uncached mappings
  with cached ones. This way, there is no need to mangle any guest page
  tables.
 
  The downside is that, to do this correctly, we need to always trap
  writes to
  the VM sysreg group, which includes registers that the guest may write
  to very
  often. To reduce the associated performance hit, patch #1 introduces a
  fast path
  for EL2 to perform trivial sysreg writes on behalf of the guest,
  without the
  need for a full world switch to the host and back.
 
  The main purpose of these patches is to quantify the performance hit, and
  verify whether the MAIR_EL1 handling works correctly.
  
  I gave this a quick spin on a VM running with QEMU.
  
* VGA output is still distorted, I get random junk black lines in the
  output in between
* When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even
  boot up
 
 Do you also have the dirty page tracking patches in your host kernel? I
 needed both (and got them via Drew's backport, thanks) and then both VGA
 and USB started working fine.

Assuming you have the dirty page tracking already, then you're probably
missing the fixup to patch 2/3, s/0xbb/0xff/

 
 Without the MAIR patches, I got cache-line size random corruptions in
 the VGA display (16 pixel wide small segments). Without dirty page
 tracking, big chunks (sometimes even almost the entire screen) was blank.
 
 Regarding USB, unless you have both of the patchsets in the host kernel,
 the guest will indeed crash early during boot. Gerd confirmed for me
 that usb controller (all uhci/ehci/xhci) pci regions see both read
 (status bits) and write (control bits) access. So if there's any
 corruption in there, on read, that looks like a malfunctioning piece of
 hw for the guest kernel, and in this case it happens to crash.
 
  With TCG, both bits work fine.
 
 Yep.
 
 Thanks
 Laszlo
 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-03 Thread Alexander Graf


On 02/19/2015 11:54 AM, Ard Biesheuvel wrote:

This is a 0th order approximation of how we could potentially force the guest
to avoid uncached mappings, at least from the moment the MMU is on. (Before
that, all of memory is implicitly classified as Device-nGnRnE)

The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
with cached ones. This way, there is no need to mangle any guest page tables.

The downside is that, to do this correctly, we need to always trap writes to
the VM sysreg group, which includes registers that the guest may write to very
often. To reduce the associated performance hit, patch #1 introduces a fast path
for EL2 to perform trivial sysreg writes on behalf of the guest, without the
need for a full world switch to the host and back.

The main purpose of these patches is to quantify the performance hit, and
verify whether the MAIR_EL1 handling works correctly.


I gave this a quick spin on a VM running with QEMU.

  * VGA output is still distorted, I get random junk black lines in the 
output in between
  * When I add -device nec-usb-xhci -device usb-kbd the VM doesn't even 
boot up


With TCG, both bits work fine.


Alex



Ard Biesheuvel (3):
   arm64: KVM: handle some sysreg writes in EL2
   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
   arm64: KVM: keep trapping of VM sysreg writes enabled

  arch/arm/kvm/mmu.c   |   2 +-
  arch/arm64/include/asm/kvm_arm.h |   2 +-
  arch/arm64/kvm/hyp.S | 101 +++
  arch/arm64/kvm/sys_regs.c|  63 
  4 files changed, 156 insertions(+), 12 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-02 Thread Mario Smarduch

Hi Christoffer,

I don't understand how can the CPU handle different cache attributes
used by
QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache
evictions or
cleans wipe out guest updates to same cache line(s)?

- Mario


On 03/02/2015 08:31 AM, Christoffer Dall wrote:
 On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote:
 On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote:
 On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
 On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote:
 On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
 So looks like the 3 orders of magnitude greater number of traps
 (only to el2) don't impact kernel compiles.


 OK, good! That was what I was hoping for, obviously.

 Then I thought I'd be able to quick measure the number of cycles
 a trap to el2 takes with this kvm-unit-tests test

 int main(void)
 {
 unsigned long start, end;
 unsigned int sctlr;

 asm volatile(
mrs %0, sctlr_el1\n
msr pmcr_el0, %1\n
 : =r (sctlr) : r (5));

 asm volatile(
mrs %0, pmccntr_el0\n
msr sctlr_el1, %2\n
mrs %1, pmccntr_el0\n
 : =r (start), =r (end) : r (sctlr));

 printf(%llx\n, end - start);
 return 0;
 }

 after applying this patch to kvm

 diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
 index bb91b6fc63861..5de39d740aa58 100644
 --- a/arch/arm64/kvm/hyp.S
 +++ b/arch/arm64/kvm/hyp.S
 @@ -770,7 +770,7 @@

 mrs x2, mdcr_el2
 and x2, x2, #MDCR_EL2_HPMN_MASK
 -   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)

 // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap

 But I get zero for the cycle count. Not sure what I'm missing.


 No clue tbh. Does the counter work as expected in the host?


 Guess not. I dropped the test into a module_init and inserted
 it on the host. Always get zero for pmccntr_el0 reads. Or, if
 I set it to something non-zero with a write, then I always get
 that back - no increments. pmcr_el0 looks OK... I had forgotten
 to set bit 31 of pmcntenset_el0, but doing that still doesn't
 help. Anyway, I assume the problem is me. I'll keep looking to
 see what I'm missing.


 I returned to this and see that the problem was indeed me. I needed yet
 another enable bit set (the filter register needed to be instructed to
 count cycles while in el2). I've attached the code for the curious.
 The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
 running on a host without this patch series (after TVM traps have been
 disabled), I get a pretty consistent 40.

 I checked how many vm-sysreg traps we do during the kernel compile
 benchmark. It's 124924. So it's a bit strange that we don't see the
 benchmark taking 10 to 20 seconds longer on average. I should probably
 double check my runs. In any case, while I like the approach of this
 series, the overhead is looking non-negligible.


 Thanks a lot for producing these numbers. 125k x 7k == 1 billion
 cycles == 1 second on a 1 GHz machine, I think?
 Or am I missing something? How long does the actual compile take?

 I ran a sequence of benchmarks that I occasionally run (pbzip,
 kernbench, and hackbench) and I also saw  1% performance degradation,
 so I think we can trust that somewhat.  (I can post the raw numbers when
 I have ssh access to my Linux desktop - sending this from Somewhere Over
 The Atlantic).
 
 However, my concern with these patches are on two points:
 
 1. It's not a fix-all.  We still have the case where the guest expects
 the behavior of device memory (for strong ordering for example) on a RAM
 region, which we now break.  Similiarly this doesn't support the
 non-coherent DMA to RAM region case.
 
 2. While the code is probably as nice as this kind of stuff gets, it
 is non-trivial and extremely difficult to debug.  The counter-point here
 is that we may end up handling other stuff at EL2 for performanc reasons
 in the future.
 
 Mainly because of point 1 above, I am leaning to thinking userspace
 should do the invalidation when it knows it needs to, either through KVM
 via a memslot flag or through some other syscall mechanism.
 
 Thanks,
 -Christoffer
 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-02 Thread Christoffer Dall

On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote:
 On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote:
  On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
  On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote:
   On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
So looks like the 3 orders of magnitude greater number of traps
(only to el2) don't impact kernel compiles.
   
  
   OK, good! That was what I was hoping for, obviously.
  
Then I thought I'd be able to quick measure the number of cycles
a trap to el2 takes with this kvm-unit-tests test
   
int main(void)
{
unsigned long start, end;
unsigned int sctlr;
   
asm volatile(
   mrs %0, sctlr_el1\n
   msr pmcr_el0, %1\n
: =r (sctlr) : r (5));
   
asm volatile(
   mrs %0, pmccntr_el0\n
   msr sctlr_el1, %2\n
   mrs %1, pmccntr_el0\n
: =r (start), =r (end) : r (sctlr));
   
printf(%llx\n, end - start);
return 0;
}
   
after applying this patch to kvm
   
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index bb91b6fc63861..5de39d740aa58 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -770,7 +770,7 @@
   
mrs x2, mdcr_el2
and x2, x2, #MDCR_EL2_HPMN_MASK
-   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
   
// Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
   
But I get zero for the cycle count. Not sure what I'm missing.
   
  
   No clue tbh. Does the counter work as expected in the host?
  
 
  Guess not. I dropped the test into a module_init and inserted
  it on the host. Always get zero for pmccntr_el0 reads. Or, if
  I set it to something non-zero with a write, then I always get
  that back - no increments. pmcr_el0 looks OK... I had forgotten
  to set bit 31 of pmcntenset_el0, but doing that still doesn't
  help. Anyway, I assume the problem is me. I'll keep looking to
  see what I'm missing.
 
 
  I returned to this and see that the problem was indeed me. I needed yet
  another enable bit set (the filter register needed to be instructed to
  count cycles while in el2). I've attached the code for the curious.
  The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
  running on a host without this patch series (after TVM traps have been
  disabled), I get a pretty consistent 40.
 
  I checked how many vm-sysreg traps we do during the kernel compile
  benchmark. It's 124924. So it's a bit strange that we don't see the
  benchmark taking 10 to 20 seconds longer on average. I should probably
  double check my runs. In any case, while I like the approach of this
  series, the overhead is looking non-negligible.
 
 
 Thanks a lot for producing these numbers. 125k x 7k == 1 billion
 cycles == 1 second on a 1 GHz machine, I think?
 Or am I missing something? How long does the actual compile take?
 
I ran a sequence of benchmarks that I occasionally run (pbzip,
kernbench, and hackbench) and I also saw  1% performance degradation,
so I think we can trust that somewhat.  (I can post the raw numbers when
I have ssh access to my Linux desktop - sending this from Somewhere Over
The Atlantic).

However, my concern with these patches are on two points:

1. It's not a fix-all.  We still have the case where the guest expects
the behavior of device memory (for strong ordering for example) on a RAM
region, which we now break.  Similiarly this doesn't support the
non-coherent DMA to RAM region case.

2. While the code is probably as nice as this kind of stuff gets, it
is non-trivial and extremely difficult to debug.  The counter-point here
is that we may end up handling other stuff at EL2 for performanc reasons
in the future.

Mainly because of point 1 above, I am leaning to thinking userspace
should do the invalidation when it knows it needs to, either through KVM
via a memslot flag or through some other syscall mechanism.

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-02 Thread Laszlo Ersek

On 03/02/15 17:47, Paolo Bonzini wrote:
 
 Also, we may want to invalidate the cache for dirty pages before
 returning the dirty bitmap, and probably should do that directly in
 KVM_GET_DIRTY_LOG.

I agree.

If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
userspace's aspect), then the cache invalidation should be an atomic
part of it too (from the same aspect).

(Sorry if I just said something incredibly stupid.)

Laszlo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-02 Thread Andrew Jones

On Mon, Mar 02, 2015 at 05:55:44PM +0100, Laszlo Ersek wrote:
 On 03/02/15 17:47, Paolo Bonzini wrote:
  
  Also, we may want to invalidate the cache for dirty pages before
  returning the dirty bitmap, and probably should do that directly in
  KVM_GET_DIRTY_LOG.
 
 I agree.
 
 If KVM_GET_DIRTY_LOG is supposed to be atomic fetch and clear (from
 userspace's aspect), then the cache invalidation should be an atomic
 part of it too (from the same aspect).
 
 (Sorry if I just said something incredibly stupid.)


With the path I'm headed down, all cache maintenance operations will
be done before exiting to userspace (and after returning). I was
actually already letting a feature creep into this PoC by setting
KVM_MEM_LOG_DIRTY_PAGES when we see KVM_MEM_INCOHERENT has been set,
and the region isn't readonly. The dirty log would then be used by
KVM internally to know exactly which pages need to be invalidated
before the exit.

drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-02 Thread Paolo Bonzini



On 02/03/2015 17:31, Christoffer Dall wrote:
 2. While the code is probably as nice as this kind of stuff gets, it
 is non-trivial and extremely difficult to debug.  The counter-point here
 is that we may end up handling other stuff at EL2 for performanc reasons
 in the future.
 
 Mainly because of point 1 above, I am leaning to thinking userspace
 should do the invalidation when it knows it needs to, either through KVM
 via a memslot flag or through some other syscall mechanism.

I'm okay with adding a KVM capability and ioctl that flushes the dcache
for a given gpa range.  However:

1) I'd like to have an implementation for QEMU and/or kvmtool before
accepting that ioctl.

2) I think the ioctl should work whatever the stage1 mapping is (e.g.
with and without Ard's patches, with and without Laszlo's OVMF patch, etc.).

Also, we may want to invalidate the cache for dirty pages before
returning the dirty bitmap, and probably should do that directly in
KVM_GET_DIRTY_LOG.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-03-02 Thread Andrew Jones

On Mon, Mar 02, 2015 at 08:31:46AM -0800, Christoffer Dall wrote:
 On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote:
  On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote:
   On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
   On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote:
On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
 So looks like the 3 orders of magnitude greater number of traps
 (only to el2) don't impact kernel compiles.

   
OK, good! That was what I was hoping for, obviously.
   
 Then I thought I'd be able to quick measure the number of cycles
 a trap to el2 takes with this kvm-unit-tests test

 int main(void)
 {
 unsigned long start, end;
 unsigned int sctlr;

 asm volatile(
mrs %0, sctlr_el1\n
msr pmcr_el0, %1\n
 : =r (sctlr) : r (5));

 asm volatile(
mrs %0, pmccntr_el0\n
msr sctlr_el1, %2\n
mrs %1, pmccntr_el0\n
 : =r (start), =r (end) : r (sctlr));

 printf(%llx\n, end - start);
 return 0;
 }

 after applying this patch to kvm

 diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
 index bb91b6fc63861..5de39d740aa58 100644
 --- a/arch/arm64/kvm/hyp.S
 +++ b/arch/arm64/kvm/hyp.S
 @@ -770,7 +770,7 @@

 mrs x2, mdcr_el2
 and x2, x2, #MDCR_EL2_HPMN_MASK
 -   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)

 // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap

 But I get zero for the cycle count. Not sure what I'm missing.

   
No clue tbh. Does the counter work as expected in the host?
   
  
   Guess not. I dropped the test into a module_init and inserted
   it on the host. Always get zero for pmccntr_el0 reads. Or, if
   I set it to something non-zero with a write, then I always get
   that back - no increments. pmcr_el0 looks OK... I had forgotten
   to set bit 31 of pmcntenset_el0, but doing that still doesn't
   help. Anyway, I assume the problem is me. I'll keep looking to
   see what I'm missing.
  
  
   I returned to this and see that the problem was indeed me. I needed yet
   another enable bit set (the filter register needed to be instructed to
   count cycles while in el2). I've attached the code for the curious.
   The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
   running on a host without this patch series (after TVM traps have been
   disabled), I get a pretty consistent 40.
  
   I checked how many vm-sysreg traps we do during the kernel compile
   benchmark. It's 124924. So it's a bit strange that we don't see the
   benchmark taking 10 to 20 seconds longer on average. I should probably
   double check my runs. In any case, while I like the approach of this
   series, the overhead is looking non-negligible.
  
  
  Thanks a lot for producing these numbers. 125k x 7k == 1 billion
  cycles == 1 second on a 1 GHz machine, I think?
  Or am I missing something? How long does the actual compile take?
  
 I ran a sequence of benchmarks that I occasionally run (pbzip,
 kernbench, and hackbench) and I also saw  1% performance degradation,
 so I think we can trust that somewhat.  (I can post the raw numbers when
 I have ssh access to my Linux desktop - sending this from Somewhere Over
 The Atlantic).
 
 However, my concern with these patches are on two points:
 
 1. It's not a fix-all.  We still have the case where the guest expects
 the behavior of device memory (for strong ordering for example) on a RAM
 region, which we now break.  Similiarly this doesn't support the
 non-coherent DMA to RAM region case.
 
 2. While the code is probably as nice as this kind of stuff gets, it
 is non-trivial and extremely difficult to debug.  The counter-point here
 is that we may end up handling other stuff at EL2 for performanc reasons
 in the future.
 
 Mainly because of point 1 above, I am leaning to thinking userspace
 should do the invalidation when it knows it needs to, either through KVM
 via a memslot flag or through some other syscall mechanism.

I've started down the memslot flag road by promoting KVM_MEMSLOT_INCOHERENT
to uapi/KVM_MEM_INCOHERENT, replacing the readonly memslot heuristic.
With a couple more changes it should work for all memory regions with
the 'incoherent' property. I'll make some changes to QEMU to test it all
out as well. Progress was slow last week due to too many higher priority
tasks, but I plan to return to it this week.

Thanks,
drew


 
 Thanks,
 -Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-24 Thread Andrew Jones

On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
 On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote:
  On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
   So looks like the 3 orders of magnitude greater number of traps
   (only to el2) don't impact kernel compiles.
  
  
  OK, good! That was what I was hoping for, obviously.
  
   Then I thought I'd be able to quick measure the number of cycles
   a trap to el2 takes with this kvm-unit-tests test
  
   int main(void)
   {
   unsigned long start, end;
   unsigned int sctlr;
  
   asm volatile(
  mrs %0, sctlr_el1\n
  msr pmcr_el0, %1\n
   : =r (sctlr) : r (5));
  
   asm volatile(
  mrs %0, pmccntr_el0\n
  msr sctlr_el1, %2\n
  mrs %1, pmccntr_el0\n
   : =r (start), =r (end) : r (sctlr));
  
   printf(%llx\n, end - start);
   return 0;
   }
  
   after applying this patch to kvm
  
   diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
   index bb91b6fc63861..5de39d740aa58 100644
   --- a/arch/arm64/kvm/hyp.S
   +++ b/arch/arm64/kvm/hyp.S
   @@ -770,7 +770,7 @@
  
   mrs x2, mdcr_el2
   and x2, x2, #MDCR_EL2_HPMN_MASK
   -   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
   +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
   orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
  
   // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
  
   But I get zero for the cycle count. Not sure what I'm missing.
  
  
  No clue tbh. Does the counter work as expected in the host?
 
 
 Guess not. I dropped the test into a module_init and inserted
 it on the host. Always get zero for pmccntr_el0 reads. Or, if
 I set it to something non-zero with a write, then I always get
 that back - no increments. pmcr_el0 looks OK... I had forgotten
 to set bit 31 of pmcntenset_el0, but doing that still doesn't
 help. Anyway, I assume the problem is me. I'll keep looking to
 see what I'm missing.


I returned to this and see that the problem was indeed me. I needed yet
another enable bit set (the filter register needed to be instructed to
count cycles while in el2). I've attached the code for the curious.
The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
running on a host without this patch series (after TVM traps have been
disabled), I get a pretty consistent 40.

I checked how many vm-sysreg traps we do during the kernel compile
benchmark. It's 124924. So it's a bit strange that we don't see the
benchmark taking 10 to 20 seconds longer on average. I should probably
double check my runs. In any case, while I like the approach of this
series, the overhead is looking non-negligible.

drew
#include libcflat.h

static void prep_cc(void)
{
asm volatile(
   msr pmovsclr_el0, %0\n
   msr pmccfiltr_el0, %1\n
   msr pmcntenset_el0, %2\n
   msr pmcr_el0, %3\n
   isb\n
:
: r (1  31), r (1  27), r (1  31),
  r (1  6 | 1  2 | 1  0));
}

int main(void)
{
unsigned long start, end;
unsigned int sctlr;
int i, zeros = 0;

asm volatile(mrs %0, sctlr_el1 : =r (sctlr));
prep_cc();

for (i = 0; i  10; ++i) {
asm volatile(
   mrs %0, pmccntr_el0\n
   msr sctlr_el1, %2\n
   mrs %1, pmccntr_el0\n
   isb\n
: =r (start), =r (end) : r (sctlr));

if ((i % 10) == 0)
printf(\n);

printf( %d, end - start);

if ((end - start) == 0) {
++zeros;
prep_cc();
}
}

printf(\nnum zero counts = %d\n, zeros);
return 0;
}

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-24 Thread Ard Biesheuvel

On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote:
 On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
 On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote:
  On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
   So looks like the 3 orders of magnitude greater number of traps
   (only to el2) don't impact kernel compiles.
  
 
  OK, good! That was what I was hoping for, obviously.
 
   Then I thought I'd be able to quick measure the number of cycles
   a trap to el2 takes with this kvm-unit-tests test
  
   int main(void)
   {
   unsigned long start, end;
   unsigned int sctlr;
  
   asm volatile(
  mrs %0, sctlr_el1\n
  msr pmcr_el0, %1\n
   : =r (sctlr) : r (5));
  
   asm volatile(
  mrs %0, pmccntr_el0\n
  msr sctlr_el1, %2\n
  mrs %1, pmccntr_el0\n
   : =r (start), =r (end) : r (sctlr));
  
   printf(%llx\n, end - start);
   return 0;
   }
  
   after applying this patch to kvm
  
   diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
   index bb91b6fc63861..5de39d740aa58 100644
   --- a/arch/arm64/kvm/hyp.S
   +++ b/arch/arm64/kvm/hyp.S
   @@ -770,7 +770,7 @@
  
   mrs x2, mdcr_el2
   and x2, x2, #MDCR_EL2_HPMN_MASK
   -   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
   +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
   orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
  
   // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
  
   But I get zero for the cycle count. Not sure what I'm missing.
  
 
  No clue tbh. Does the counter work as expected in the host?
 

 Guess not. I dropped the test into a module_init and inserted
 it on the host. Always get zero for pmccntr_el0 reads. Or, if
 I set it to something non-zero with a write, then I always get
 that back - no increments. pmcr_el0 looks OK... I had forgotten
 to set bit 31 of pmcntenset_el0, but doing that still doesn't
 help. Anyway, I assume the problem is me. I'll keep looking to
 see what I'm missing.


 I returned to this and see that the problem was indeed me. I needed yet
 another enable bit set (the filter register needed to be instructed to
 count cycles while in el2). I've attached the code for the curious.
 The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
 running on a host without this patch series (after TVM traps have been
 disabled), I get a pretty consistent 40.

 I checked how many vm-sysreg traps we do during the kernel compile
 benchmark. It's 124924. So it's a bit strange that we don't see the
 benchmark taking 10 to 20 seconds longer on average. I should probably
 double check my runs. In any case, while I like the approach of this
 series, the overhead is looking non-negligible.


Thanks a lot for producing these numbers. 125k x 7k == 1 billion
cycles == 1 second on a 1 GHz machine, I think?
Or am I missing something? How long does the actual compile take?

-- 
Ard.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-24 Thread Andrew Jones

On Tue, Feb 24, 2015 at 05:47:19PM +, Ard Biesheuvel wrote:
 On 24 February 2015 at 14:55, Andrew Jones drjo...@redhat.com wrote:
  On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
  On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote:
   On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
So looks like the 3 orders of magnitude greater number of traps
(only to el2) don't impact kernel compiles.
   
  
   OK, good! That was what I was hoping for, obviously.
  
Then I thought I'd be able to quick measure the number of cycles
a trap to el2 takes with this kvm-unit-tests test
   
int main(void)
{
unsigned long start, end;
unsigned int sctlr;
   
asm volatile(
   mrs %0, sctlr_el1\n
   msr pmcr_el0, %1\n
: =r (sctlr) : r (5));
   
asm volatile(
   mrs %0, pmccntr_el0\n
   msr sctlr_el1, %2\n
   mrs %1, pmccntr_el0\n
: =r (start), =r (end) : r (sctlr));
   
printf(%llx\n, end - start);
return 0;
}
   
after applying this patch to kvm
   
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index bb91b6fc63861..5de39d740aa58 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -770,7 +770,7 @@
   
mrs x2, mdcr_el2
and x2, x2, #MDCR_EL2_HPMN_MASK
-   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
   
// Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
   
But I get zero for the cycle count. Not sure what I'm missing.
   
  
   No clue tbh. Does the counter work as expected in the host?
  
 
  Guess not. I dropped the test into a module_init and inserted
  it on the host. Always get zero for pmccntr_el0 reads. Or, if
  I set it to something non-zero with a write, then I always get
  that back - no increments. pmcr_el0 looks OK... I had forgotten
  to set bit 31 of pmcntenset_el0, but doing that still doesn't
  help. Anyway, I assume the problem is me. I'll keep looking to
  see what I'm missing.
 
 
  I returned to this and see that the problem was indeed me. I needed yet
  another enable bit set (the filter register needed to be instructed to
  count cycles while in el2). I've attached the code for the curious.
  The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
  running on a host without this patch series (after TVM traps have been
  disabled), I get a pretty consistent 40.
 
  I checked how many vm-sysreg traps we do during the kernel compile
  benchmark. It's 124924. So it's a bit strange that we don't see the
  benchmark taking 10 to 20 seconds longer on average. I should probably
  double check my runs. In any case, while I like the approach of this
  series, the overhead is looking non-negligible.
 
 
 Thanks a lot for producing these numbers. 125k x 7k == 1 billion
 cycles == 1 second on a 1 GHz machine, I think?
 Or am I missing something? How long does the actual compile take?


Wait, my fault. I dropped a pretty big divisor in my calculation. Don't
ask... I'll just go home and study one of my daughter's math books now...

So, I even have a 2.4 GHz machine, which explains why the benchmark times
are the same with and without this series (those times are provided earlier
in this thread, they're roughly 03:10). I'm glad you straighted me out. I
was second guessing my benchmark results, and considering redoing them.

Anyway, this series, at least wrt to overhead, is looking good again.

Thanks,
drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-20 Thread Andrew Jones

On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:
 
 
 On 19/02/2015 18:55, Andrew Jones wrote:
(I don't have an exact number for how many times it went to EL1 because
 access_mair() doesn't have a trace point.)
(I got the 62873 number by testing a 3rd kernel build that only had 
patch
 3/3 applied to the base, and counting kvm_toggle_cache events.)
(The number 50 is the number of kvm_toggle_cache events *without* 3/3
 applied.)
   
I consider this bad news because, even considering it only goes to EL2,
it goes a ton more than it used to. I realize patch 3/3 isn't the final
plan for enabling traps though.
 
 If a full guest boots, can you try timing a kernel compile?


Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
times inside the guest for each host kernel; base and mair. I dropped
the time from the first run of each set, and captured the other 3.
Command line used below. Time is from the
  Elapsed (wall clock) time (h:mm:ss or m:ss):
output of /usr/bin/time - the host's wall clock.

  /usr/bin/time --verbose ssh $VM 'cd kernel  make -s clean  make -s -j8'

Results:
base: 3:06.11 3:07.00 3:10.93
mair: 3:08.47 3:06.75 3:04.76

So looks like the 3 orders of magnitude greater number of traps
(only to el2) don't impact kernel compiles.

Then I thought I'd be able to quick measure the number of cycles
a trap to el2 takes with this kvm-unit-tests test

int main(void)
{
unsigned long start, end;
unsigned int sctlr;

asm volatile(
   mrs %0, sctlr_el1\n
   msr pmcr_el0, %1\n
: =r (sctlr) : r (5));

asm volatile(
   mrs %0, pmccntr_el0\n
   msr sctlr_el1, %2\n
   mrs %1, pmccntr_el0\n
: =r (start), =r (end) : r (sctlr));

printf(%llx\n, end - start);
return 0;
}

after applying this patch to kvm

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index bb91b6fc63861..5de39d740aa58 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -770,7 +770,7 @@
 
mrs x2, mdcr_el2
and x2, x2, #MDCR_EL2_HPMN_MASK
-   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
 
// Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap

But I get zero for the cycle count. Not sure what I'm missing.

drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-20 Thread Ard Biesheuvel

On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
 On Thu, Feb 19, 2015 at 06:57:24PM +0100, Paolo Bonzini wrote:


 On 19/02/2015 18:55, Andrew Jones wrote:
(I don't have an exact number for how many times it went to EL1 
because
 access_mair() doesn't have a trace point.)
(I got the 62873 number by testing a 3rd kernel build that only had 
patch
 3/3 applied to the base, and counting kvm_toggle_cache events.)
(The number 50 is the number of kvm_toggle_cache events *without* 3/3
 applied.)
   
I consider this bad news because, even considering it only goes to 
EL2,
it goes a ton more than it used to. I realize patch 3/3 isn't the 
final
plan for enabling traps though.

 If a full guest boots, can you try timing a kernel compile?


 Guests boot. I used an 8 vcpu, 14G memory guest; compiled the kernel 4
 times inside the guest for each host kernel; base and mair. I dropped
 the time from the first run of each set, and captured the other 3.
 Command line used below. Time is from the
   Elapsed (wall clock) time (h:mm:ss or m:ss):
 output of /usr/bin/time - the host's wall clock.

   /usr/bin/time --verbose ssh $VM 'cd kernel  make -s clean  make -s -j8'

 Results:
 base: 3:06.11 3:07.00 3:10.93
 mair: 3:08.47 3:06.75 3:04.76

 So looks like the 3 orders of magnitude greater number of traps
 (only to el2) don't impact kernel compiles.


OK, good! That was what I was hoping for, obviously.

 Then I thought I'd be able to quick measure the number of cycles
 a trap to el2 takes with this kvm-unit-tests test

 int main(void)
 {
 unsigned long start, end;
 unsigned int sctlr;

 asm volatile(
mrs %0, sctlr_el1\n
msr pmcr_el0, %1\n
 : =r (sctlr) : r (5));

 asm volatile(
mrs %0, pmccntr_el0\n
msr sctlr_el1, %2\n
mrs %1, pmccntr_el0\n
 : =r (start), =r (end) : r (sctlr));

 printf(%llx\n, end - start);
 return 0;
 }

 after applying this patch to kvm

 diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
 index bb91b6fc63861..5de39d740aa58 100644
 --- a/arch/arm64/kvm/hyp.S
 +++ b/arch/arm64/kvm/hyp.S
 @@ -770,7 +770,7 @@

 mrs x2, mdcr_el2
 and x2, x2, #MDCR_EL2_HPMN_MASK
 -   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
 orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)

 // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap

 But I get zero for the cycle count. Not sure what I'm missing.


No clue tbh. Does the counter work as expected in the host?

-- 
Ard.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-20 Thread Andrew Jones

On Fri, Feb 20, 2015 at 02:37:25PM +, Ard Biesheuvel wrote:
 On 20 February 2015 at 14:29, Andrew Jones drjo...@redhat.com wrote:
  So looks like the 3 orders of magnitude greater number of traps
  (only to el2) don't impact kernel compiles.
 
 
 OK, good! That was what I was hoping for, obviously.
 
  Then I thought I'd be able to quick measure the number of cycles
  a trap to el2 takes with this kvm-unit-tests test
 
  int main(void)
  {
  unsigned long start, end;
  unsigned int sctlr;
 
  asm volatile(
 mrs %0, sctlr_el1\n
 msr pmcr_el0, %1\n
  : =r (sctlr) : r (5));
 
  asm volatile(
 mrs %0, pmccntr_el0\n
 msr sctlr_el1, %2\n
 mrs %1, pmccntr_el0\n
  : =r (start), =r (end) : r (sctlr));
 
  printf(%llx\n, end - start);
  return 0;
  }
 
  after applying this patch to kvm
 
  diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
  index bb91b6fc63861..5de39d740aa58 100644
  --- a/arch/arm64/kvm/hyp.S
  +++ b/arch/arm64/kvm/hyp.S
  @@ -770,7 +770,7 @@
 
  mrs x2, mdcr_el2
  and x2, x2, #MDCR_EL2_HPMN_MASK
  -   orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
  +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
  orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
 
  // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
 
  But I get zero for the cycle count. Not sure what I'm missing.
 
 
 No clue tbh. Does the counter work as expected in the host?


Guess not. I dropped the test into a module_init and inserted
it on the host. Always get zero for pmccntr_el0 reads. Or, if
I set it to something non-zero with a write, then I always get
that back - no increments. pmcr_el0 looks OK... I had forgotten
to set bit 31 of pmcntenset_el0, but doing that still doesn't
help. Anyway, I assume the problem is me. I'll keep looking to
see what I'm missing.

drew
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Ard Biesheuvel

This is a 0th order approximation of how we could potentially force the guest
to avoid uncached mappings, at least from the moment the MMU is on. (Before
that, all of memory is implicitly classified as Device-nGnRnE)

The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached mappings
with cached ones. This way, there is no need to mangle any guest page tables.

The downside is that, to do this correctly, we need to always trap writes to
the VM sysreg group, which includes registers that the guest may write to very
often. To reduce the associated performance hit, patch #1 introduces a fast path
for EL2 to perform trivial sysreg writes on behalf of the guest, without the
need for a full world switch to the host and back.

The main purpose of these patches is to quantify the performance hit, and
verify whether the MAIR_EL1 handling works correctly. 

Ard Biesheuvel (3):
  arm64: KVM: handle some sysreg writes in EL2
  arm64: KVM: mangle MAIR register to prevent uncached guest mappings
  arm64: KVM: keep trapping of VM sysreg writes enabled

 arch/arm/kvm/mmu.c   |   2 +-
 arch/arm64/include/asm/kvm_arm.h |   2 +-
 arch/arm64/kvm/hyp.S | 101 +++
 arch/arm64/kvm/sys_regs.c|  63 
 4 files changed, 156 insertions(+), 12 deletions(-)

-- 
1.8.3.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Ard Biesheuvel

On 19 February 2015 at 15:27, Alexander Graf ag...@suse.de wrote:


 On 19.02.15 15:56, Ard Biesheuvel wrote:
 On 19 February 2015 at 14:50, Alexander Graf ag...@suse.de wrote:


 On 19.02.15 11:54, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the 
 guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)

 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page 
 tables.

 Would you mind to give a brief explanation on what this does? What
 happens to actually assigned devices that need to be mapped as uncached?
 What happens to DMA from such devices when the guest assumes that it's
 accessing RAM uncached and then triggers DMA?


 On ARM, stage 2 mappings that are more strict will supersede stage 1
 mappings, so the idea is to use cached mappings exclusively for stage
 1 so that the host is fully in control of the actual memory attributes
 by setting the attributes at stage 2. This also makes sense because
 the host will ultimately know better whether some range that the guest
 thinks is a device is actually a device or just emulated (no stage 2
 mapping), backed by host memory (such as the NOR flash read case) or
 backed by a passthrough device.

 Ok, so that means if the guest maps RAM as uncached, it will actually
 end up as cached memory. Now if the guest triggers a DMA request to a
 passed through device to that RAM, it will conflict with the cache.

 I don't know whether it's a big deal, but it's the scenario that came up
 with the approach above before when I talked to people about it.


Well, I am using write-through read+write allocate, which hopefully
means that the actual RAM is kept in sync with the cache, but I must
confess I am a bit out of my depth here with the fine print in the ARM
ARM.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Alexander Graf



On 19.02.15 15:56, Ard Biesheuvel wrote:
 On 19 February 2015 at 14:50, Alexander Graf ag...@suse.de wrote:


 On 19.02.15 11:54, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the 
 guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)

 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page 
 tables.

 Would you mind to give a brief explanation on what this does? What
 happens to actually assigned devices that need to be mapped as uncached?
 What happens to DMA from such devices when the guest assumes that it's
 accessing RAM uncached and then triggers DMA?

 
 On ARM, stage 2 mappings that are more strict will supersede stage 1
 mappings, so the idea is to use cached mappings exclusively for stage
 1 so that the host is fully in control of the actual memory attributes
 by setting the attributes at stage 2. This also makes sense because
 the host will ultimately know better whether some range that the guest
 thinks is a device is actually a device or just emulated (no stage 2
 mapping), backed by host memory (such as the NOR flash read case) or
 backed by a passthrough device.

Ok, so that means if the guest maps RAM as uncached, it will actually
end up as cached memory. Now if the guest triggers a DMA request to a
passed through device to that RAM, it will conflict with the cache.

I don't know whether it's a big deal, but it's the scenario that came up
with the approach above before when I talked to people about it.


Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Andrew Jones

On Thu, Feb 19, 2015 at 05:19:35PM +, Ard Biesheuvel wrote:
 On 19 February 2015 at 16:57, Andrew Jones drjo...@redhat.com wrote:
  On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote:
  This is a 0th order approximation of how we could potentially force the 
  guest
  to avoid uncached mappings, at least from the moment the MMU is on. (Before
  that, all of memory is implicitly classified as Device-nGnRnE)
 
  The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
  mappings
  with cached ones. This way, there is no need to mangle any guest page 
  tables.
 
  The downside is that, to do this correctly, we need to always trap writes 
  to
  the VM sysreg group, which includes registers that the guest may write to 
  very
  often. To reduce the associated performance hit, patch #1 introduces a 
  fast path
  for EL2 to perform trivial sysreg writes on behalf of the guest, without 
  the
  need for a full world switch to the host and back.
 
  The main purpose of these patches is to quantify the performance hit, and
  verify whether the MAIR_EL1 handling works correctly.
 
  Ard Biesheuvel (3):
arm64: KVM: handle some sysreg writes in EL2
arm64: KVM: mangle MAIR register to prevent uncached guest mappings
arm64: KVM: keep trapping of VM sysreg writes enabled
 
  Hi Ard,
 
  I took this series for test drive. Unfortunately I have bad news and worse
  news. First, a description of the test; simply boot a guest, once at login,
  login, and then shutdown with 'poweroff'. The guest boots through AAVMF 
  using
  a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
  as cached' kludge. This test allows us to check for corrupt vram on the
  graphical console, plus it completes a boot/shutdown cycle allowing us to
  count sysreg traps of the boot/shutdown cycle.
 
 
 Thanks a lot for giving this a spin right away!
 
  So, the bad news
 
  Before this series we trapped 50 times on sysreg writes with the test
  described above. With this series we trap 62873 times. But, less than
  20 required going to EL1.
 
 
 OK, this is very useful information. We still don't know what the
 penalty is of all those traps, but that's quite a big number indeed.
 
  (I don't have an exact number for how many times it went to EL1 because
   access_mair() doesn't have a trace point.)
  (I got the 62873 number by testing a 3rd kernel build that only had patch
   3/3 applied to the base, and counting kvm_toggle_cache events.)
  (The number 50 is the number of kvm_toggle_cache events *without* 3/3
   applied.)
 
  I consider this bad news because, even considering it only goes to EL2,
  it goes a ton more than it used to. I realize patch 3/3 isn't the final
  plan for enabling traps though.
 
  And, now the worse news
 
  The vram corruption persists with this patch series.
 
 
 OK, so the primary difference is that I am not substituting for write
 back mappings, as Laszlo is doing in his patch.
 If you have energy left, would you mind having another go but use 0xff
 (not 0xbb) for the MAIR values in patch #2?

Yup, a bit energy left, and, yup, 0xff fixes it.

Thanks,
drew

 
 
   arch/arm/kvm/mmu.c   |   2 +-
   arch/arm64/include/asm/kvm_arm.h |   2 +-
   arch/arm64/kvm/hyp.S | 101 
  +++
   arch/arm64/kvm/sys_regs.c|  63 
   4 files changed, 156 insertions(+), 12 deletions(-)
 
  --
  1.8.3.2
 
  ___
  kvmarm mailing list
  kvm...@lists.cs.columbia.edu
  https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Ard Biesheuvel


 On 19 feb. 2015, at 17:55, Andrew Jones drjo...@redhat.com wrote:
 
 On Thu, Feb 19, 2015 at 05:19:35PM +, Ard Biesheuvel wrote:
 On 19 February 2015 at 16:57, Andrew Jones drjo...@redhat.com wrote:
 On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the 
 guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)
 
 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page 
 tables.
 
 The downside is that, to do this correctly, we need to always trap writes 
 to
 the VM sysreg group, which includes registers that the guest may write to 
 very
 often. To reduce the associated performance hit, patch #1 introduces a 
 fast path
 for EL2 to perform trivial sysreg writes on behalf of the guest, without 
 the
 need for a full world switch to the host and back.
 
 The main purpose of these patches is to quantify the performance hit, and
 verify whether the MAIR_EL1 handling works correctly.
 
 Ard Biesheuvel (3):
  arm64: KVM: handle some sysreg writes in EL2
  arm64: KVM: mangle MAIR register to prevent uncached guest mappings
  arm64: KVM: keep trapping of VM sysreg writes enabled
 
 Hi Ard,
 
 I took this series for test drive. Unfortunately I have bad news and worse
 news. First, a description of the test; simply boot a guest, once at login,
 login, and then shutdown with 'poweroff'. The guest boots through AAVMF 
 using
 a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
 as cached' kludge. This test allows us to check for corrupt vram on the
 graphical console, plus it completes a boot/shutdown cycle allowing us to
 count sysreg traps of the boot/shutdown cycle.
 
 Thanks a lot for giving this a spin right away!
 
 So, the bad news
 
 Before this series we trapped 50 times on sysreg writes with the test
 described above. With this series we trap 62873 times. But, less than
 20 required going to EL1.
 
 OK, this is very useful information. We still don't know what the
 penalty is of all those traps, but that's quite a big number indeed.
 
 (I don't have an exact number for how many times it went to EL1 because
 access_mair() doesn't have a trace point.)
 (I got the 62873 number by testing a 3rd kernel build that only had patch
 3/3 applied to the base, and counting kvm_toggle_cache events.)
 (The number 50 is the number of kvm_toggle_cache events *without* 3/3
 applied.)
 
 I consider this bad news because, even considering it only goes to EL2,
 it goes a ton more than it used to. I realize patch 3/3 isn't the final
 plan for enabling traps though.
 
 And, now the worse news
 
 The vram corruption persists with this patch series.
 
 OK, so the primary difference is that I am not substituting for write
 back mappings, as Laszlo is doing in his patch.
 If you have energy left, would you mind having another go but use 0xff
 (not 0xbb) for the MAIR values in patch #2?
 
 Yup, a bit energy left, and, yup, 0xff fixes it

ok so that means we'd need to map as writeback cacheable by default, and 
restrict it as necessary at stage 2.

thanks


 Thanks,
 drew
 
 
 
 arch/arm/kvm/mmu.c   |   2 +-
 arch/arm64/include/asm/kvm_arm.h |   2 +-
 arch/arm64/kvm/hyp.S | 101 
 +++
 arch/arm64/kvm/sys_regs.c|  63 
 4 files changed, 156 insertions(+), 12 deletions(-)
 
 --
 1.8.3.2
 
 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Ard Biesheuvel

On 19 February 2015 at 16:57, Andrew Jones drjo...@redhat.com wrote:
 On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)

 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page tables.

 The downside is that, to do this correctly, we need to always trap writes to
 the VM sysreg group, which includes registers that the guest may write to 
 very
 often. To reduce the associated performance hit, patch #1 introduces a fast 
 path
 for EL2 to perform trivial sysreg writes on behalf of the guest, without the
 need for a full world switch to the host and back.

 The main purpose of these patches is to quantify the performance hit, and
 verify whether the MAIR_EL1 handling works correctly.

 Ard Biesheuvel (3):
   arm64: KVM: handle some sysreg writes in EL2
   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
   arm64: KVM: keep trapping of VM sysreg writes enabled

 Hi Ard,

 I took this series for test drive. Unfortunately I have bad news and worse
 news. First, a description of the test; simply boot a guest, once at login,
 login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
 a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
 as cached' kludge. This test allows us to check for corrupt vram on the
 graphical console, plus it completes a boot/shutdown cycle allowing us to
 count sysreg traps of the boot/shutdown cycle.


Thanks a lot for giving this a spin right away!

 So, the bad news

 Before this series we trapped 50 times on sysreg writes with the test
 described above. With this series we trap 62873 times. But, less than
 20 required going to EL1.


OK, this is very useful information. We still don't know what the
penalty is of all those traps, but that's quite a big number indeed.

 (I don't have an exact number for how many times it went to EL1 because
  access_mair() doesn't have a trace point.)
 (I got the 62873 number by testing a 3rd kernel build that only had patch
  3/3 applied to the base, and counting kvm_toggle_cache events.)
 (The number 50 is the number of kvm_toggle_cache events *without* 3/3
  applied.)

 I consider this bad news because, even considering it only goes to EL2,
 it goes a ton more than it used to. I realize patch 3/3 isn't the final
 plan for enabling traps though.

 And, now the worse news

 The vram corruption persists with this patch series.


OK, so the primary difference is that I am not substituting for write
back mappings, as Laszlo is doing in his patch.
If you have energy left, would you mind having another go but use 0xff
(not 0xbb) for the MAIR values in patch #2?


  arch/arm/kvm/mmu.c   |   2 +-
  arch/arm64/include/asm/kvm_arm.h |   2 +-
  arch/arm64/kvm/hyp.S | 101 
 +++
  arch/arm64/kvm/sys_regs.c|  63 
  4 files changed, 156 insertions(+), 12 deletions(-)

 --
 1.8.3.2

 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Alexander Graf



On 19.02.15 11:54, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)
 
 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page tables.

Would you mind to give a brief explanation on what this does? What
happens to actually assigned devices that need to be mapped as uncached?
What happens to DMA from such devices when the guest assumes that it's
accessing RAM uncached and then triggers DMA?


Alex

 
 The downside is that, to do this correctly, we need to always trap writes to
 the VM sysreg group, which includes registers that the guest may write to very
 often. To reduce the associated performance hit, patch #1 introduces a fast 
 path
 for EL2 to perform trivial sysreg writes on behalf of the guest, without the
 need for a full world switch to the host and back.
 
 The main purpose of these patches is to quantify the performance hit, and
 verify whether the MAIR_EL1 handling works correctly. 
 
 Ard Biesheuvel (3):
   arm64: KVM: handle some sysreg writes in EL2
   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
   arm64: KVM: keep trapping of VM sysreg writes enabled
 
  arch/arm/kvm/mmu.c   |   2 +-
  arch/arm64/include/asm/kvm_arm.h |   2 +-
  arch/arm64/kvm/hyp.S | 101 
 +++
  arch/arm64/kvm/sys_regs.c|  63 
  4 files changed, 156 insertions(+), 12 deletions(-)
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Ard Biesheuvel

On 19 February 2015 at 14:50, Alexander Graf ag...@suse.de wrote:


 On 19.02.15 11:54, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)

 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page tables.

 Would you mind to give a brief explanation on what this does? What
 happens to actually assigned devices that need to be mapped as uncached?
 What happens to DMA from such devices when the guest assumes that it's
 accessing RAM uncached and then triggers DMA?


On ARM, stage 2 mappings that are more strict will supersede stage 1
mappings, so the idea is to use cached mappings exclusively for stage
1 so that the host is fully in control of the actual memory attributes
by setting the attributes at stage 2. This also makes sense because
the host will ultimately know better whether some range that the guest
thinks is a device is actually a device or just emulated (no stage 2
mapping), backed by host memory (such as the NOR flash read case) or
backed by a passthrough device.

-- 
Ard.



 The downside is that, to do this correctly, we need to always trap writes to
 the VM sysreg group, which includes registers that the guest may write to 
 very
 often. To reduce the associated performance hit, patch #1 introduces a fast 
 path
 for EL2 to perform trivial sysreg writes on behalf of the guest, without the
 need for a full world switch to the host and back.

 The main purpose of these patches is to quantify the performance hit, and
 verify whether the MAIR_EL1 handling works correctly.

 Ard Biesheuvel (3):
   arm64: KVM: handle some sysreg writes in EL2
   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
   arm64: KVM: keep trapping of VM sysreg writes enabled

  arch/arm/kvm/mmu.c   |   2 +-
  arch/arm64/include/asm/kvm_arm.h |   2 +-
  arch/arm64/kvm/hyp.S | 101 
 +++
  arch/arm64/kvm/sys_regs.c|  63 
  4 files changed, 156 insertions(+), 12 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Andrew Jones

On Thu, Feb 19, 2015 at 10:54:43AM +, Ard Biesheuvel wrote:
 This is a 0th order approximation of how we could potentially force the guest
 to avoid uncached mappings, at least from the moment the MMU is on. (Before
 that, all of memory is implicitly classified as Device-nGnRnE)
 
 The idea (patch #2) is to trap writes to MAIR_EL1, and replace uncached 
 mappings
 with cached ones. This way, there is no need to mangle any guest page tables.
 
 The downside is that, to do this correctly, we need to always trap writes to
 the VM sysreg group, which includes registers that the guest may write to very
 often. To reduce the associated performance hit, patch #1 introduces a fast 
 path
 for EL2 to perform trivial sysreg writes on behalf of the guest, without the
 need for a full world switch to the host and back.
 
 The main purpose of these patches is to quantify the performance hit, and
 verify whether the MAIR_EL1 handling works correctly. 
 
 Ard Biesheuvel (3):
   arm64: KVM: handle some sysreg writes in EL2
   arm64: KVM: mangle MAIR register to prevent uncached guest mappings
   arm64: KVM: keep trapping of VM sysreg writes enabled

Hi Ard,

I took this series for test drive. Unfortunately I have bad news and worse
news. First, a description of the test; simply boot a guest, once at login,
login, and then shutdown with 'poweroff'. The guest boots through AAVMF using
a build from Laszlo that enables PCI, but does *not* have the 'map pci mmio
as cached' kludge. This test allows us to check for corrupt vram on the
graphical console, plus it completes a boot/shutdown cycle allowing us to
count sysreg traps of the boot/shutdown cycle.

So, the bad news

Before this series we trapped 50 times on sysreg writes with the test
described above. With this series we trap 62873 times. But, less than
20 required going to EL1.

(I don't have an exact number for how many times it went to EL1 because
 access_mair() doesn't have a trace point.)
(I got the 62873 number by testing a 3rd kernel build that only had patch
 3/3 applied to the base, and counting kvm_toggle_cache events.)
(The number 50 is the number of kvm_toggle_cache events *without* 3/3
 applied.)

I consider this bad news because, even considering it only goes to EL2,
it goes a ton more than it used to. I realize patch 3/3 isn't the final
plan for enabling traps though.

And, now the worse news

The vram corruption persists with this patch series.

drew


 
  arch/arm/kvm/mmu.c   |   2 +-
  arch/arm64/include/asm/kvm_arm.h |   2 +-
  arch/arm64/kvm/hyp.S | 101 
 +++
  arch/arm64/kvm/sys_regs.c|  63 
  4 files changed, 156 insertions(+), 12 deletions(-)
 
 -- 
 1.8.3.2
 
 ___
 kvmarm mailing list
 kvm...@lists.cs.columbia.edu
 https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

2015-02-19 Thread Paolo Bonzini



On 19/02/2015 18:55, Andrew Jones wrote:
   (I don't have an exact number for how many times it went to EL1 because
access_mair() doesn't have a trace point.)
   (I got the 62873 number by testing a 3rd kernel build that only had patch
3/3 applied to the base, and counting kvm_toggle_cache events.)
   (The number 50 is the number of kvm_toggle_cache events *without* 3/3
applied.)
  
   I consider this bad news because, even considering it only goes to EL2,
   it goes a ton more than it used to. I realize patch 3/3 isn't the final
   plan for enabling traps though.

If a full guest boots, can you try timing a kernel compile?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

47 matches

Mail list logo