Re: [PATCH v5 2/4] live migration support for initial write protect of VM

2014-05-30 Thread Mario Smarduch

  
 +static inline void kvm_set_s2pte_readonly(pte_t *pte)
 +{
 +pte_val(*pte) = ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
 
 This relies on the pte already having been set as RDONLY or RDWR, if you
 are creating a new pte and calling this function it could be easy to
 miss that distinction, I would prefer:
 
 pte_val(*pte) = L_PTE_S2_RDWR;
 pte_val(*pte) |= L_PTE_S2_RDONLY;
 

Confused on this comment, this appears to just add the read-only
permission. But will leave other permission bits intact, and
clears out the rest of the pte?

- Mario

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/4] live migration support for initial write protect of VM

2014-05-19 Thread Christoffer Dall
On Fri, May 16, 2014 at 02:39:16PM -0700, Mario Smarduch wrote:
 Hi Christoffer,
   few more comments
struct vgic_distvgic;
  + /* Marks start of migration, used to handle 2nd stage page faults
  +  * during migration, prevent installing huge pages and split huge pages
  +  * to small pages.
  +  */
 
  commenting style
 
  this is a bit verbose for a field in a struct, perhaps moving the longer
  version to where you set this?
  Will do.
 
  + int migration_in_progress;
   };
 
 I think this flag could be removed all together. Migration can be
 stopped at any time (started too), through user request or other events. 
 When that happens (like migrate_cancel) migrate cleanup bh runs and 
 eventually calls 
 KVM memory listener kvm_log_global_start() (cancel handler) 
 that stops logging, clears KVM_MEM_LOG_DIRTY_PAGES, and region ops ioctl,
 clears dirty_bitmap. In either case dirty_bitmap for memslot is set or 
 unset during migration to track dirty pages, following that field seems to be 
 a better way to keep track of migration. This again is QEMU view but it 
 appears 
 all these policies are driven from user space.
 

ok, I need to look more closely at the whole thing to properly comment
on this.

 
 
   
  +/* kvm_split_pmd - splits huge pages to small pages, required to keep a 
  dirty
  + *   log of smaller memory granules, otherwise huge pages would need 
  to be
  + *   migrated. Practically an idle system has problems migrating with
 
  This seems abrupt.  Why can't we just represent a 2M huge page as 512 4K
  bits and write protect the huge pages, if you take a write fault on a 2M
  page, then split it then.
  
  That's one alternative the one I put into v6 is clear the PMD
  and force user_mem_abort() to fault in 4k pages, and mark the
  dirty_bitmap[] for that page, reuse the current code. Have not
  checked the impact on performance, it takes few seconds longer
  to converge for the tests I'm running. 
 
 I was thinking about this and if PMD attributes need to be passed
 onto the PTEs then it appears what you recommend is required.
 But during run time I don't see how 2nd stage attributes can
 change, could the guest do anything to change them (SH, Memattr)?

You should be able to just grab the kvm_mmu lock, update the stage-2
page tables to remove all writable bits, flush all Stage-2 TLBs for that
VMID, and you should be all set.

 
 
 Performance may also be other reason but that always depends
 on the load, clearing a PMD seems easier and reuses current code.
 Probably several load tests/benchmarks can help here.
 Also noticed hw PMD/PTE attributes differ a little which
 is not significant now, but moving forward different page size
 and any new revisions to fields may require additional maintenance.

I think clearing out all PMD mappings will carry a significant
performance degradation on the source VM, and in the case you keep it
running, will be quite unfortunate.  Hint: Page faults are expensive and
huge pages have shown to give about 10-15% performance increase on ARMv7
for CPU/memory intensive benchmarks.

 
 I'll be out next week and back 26'th, I'll create a link with
 details on test environment and tests. The cover letter will
 will go through general overview only.
 

ok, I have some time then.

-Christoffer

 
  
 
  If your use case is HA, then you will be doing this a lot, and you don't
  want to hurt performance of your main live system more than necessary.
  
 
  + *   huge pages.  Called during WP of entire VM address space, done
 
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/4] live migration support for initial write protect of VM

2014-05-16 Thread Mario Smarduch
Hi Christoffer,
  few more comments
 struct vgic_distvgic;
 +   /* Marks start of migration, used to handle 2nd stage page faults
 +* during migration, prevent installing huge pages and split huge pages
 +* to small pages.
 +*/

 commenting style

 this is a bit verbose for a field in a struct, perhaps moving the longer
 version to where you set this?
 Will do.

 +   int migration_in_progress;
  };

I think this flag could be removed all together. Migration can be
stopped at any time (started too), through user request or other events. 
When that happens (like migrate_cancel) migrate cleanup bh runs and eventually 
calls 
KVM memory listener kvm_log_global_start() (cancel handler) 
that stops logging, clears KVM_MEM_LOG_DIRTY_PAGES, and region ops ioctl,
clears dirty_bitmap. In either case dirty_bitmap for memslot is set or 
unset during migration to track dirty pages, following that field seems to be 
a better way to keep track of migration. This again is QEMU view but it appears 
all these policies are driven from user space.



  
 +/* kvm_split_pmd - splits huge pages to small pages, required to keep a 
 dirty
 + * log of smaller memory granules, otherwise huge pages would need to be
 + * migrated. Practically an idle system has problems migrating with

 This seems abrupt.  Why can't we just represent a 2M huge page as 512 4K
 bits and write protect the huge pages, if you take a write fault on a 2M
 page, then split it then.
 
 That's one alternative the one I put into v6 is clear the PMD
 and force user_mem_abort() to fault in 4k pages, and mark the
 dirty_bitmap[] for that page, reuse the current code. Have not
 checked the impact on performance, it takes few seconds longer
 to converge for the tests I'm running. 

I was thinking about this and if PMD attributes need to be passed
onto the PTEs then it appears what you recommend is required.
But during run time I don't see how 2nd stage attributes can
change, could the guest do anything to change them (SH, Memattr)?


Performance may also be other reason but that always depends
on the load, clearing a PMD seems easier and reuses current code.
Probably several load tests/benchmarks can help here.
Also noticed hw PMD/PTE attributes differ a little which
is not significant now, but moving forward different page size
and any new revisions to fields may require additional maintenance.

I'll be out next week and back 26'th, I'll create a link with
details on test environment and tests. The cover letter will
will go through general overview only.

Thanks,
  Mario

 

 If your use case is HA, then you will be doing this a lot, and you don't
 want to hurt performance of your main live system more than necessary.
 

 + * huge pages.  Called during WP of entire VM address space, done



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 2/4] live migration support for initial write protect of VM

2014-05-15 Thread Christoffer Dall

[I know you sent out a newer version but I already reviewed some of this
patch on the plane today but couldn't send it out before I got home.
Anyway, here it is:]

On Wed, May 07, 2014 at 05:40:14PM -0700, Mario Smarduch wrote:
 Patch adds support for live migration initial split up of huge pages
 in memory slot and write protection of all pages in memory slot.
 
 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  arch/arm/include/asm/kvm_host.h |7 ++
  arch/arm/include/asm/kvm_mmu.h  |   16 +++-
  arch/arm/kvm/arm.c  |3 +
  arch/arm/kvm/mmu.c  |  179 
 +++
  virt/kvm/kvm_main.c |6 +-
  5 files changed, 209 insertions(+), 2 deletions(-)
 
 diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
 index ac3bb65..91744c3 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -67,6 +67,11 @@ struct kvm_arch {
  
   /* Interrupt controller */
   struct vgic_distvgic;
 + /* Marks start of migration, used to handle 2nd stage page faults
 +  * during migration, prevent installing huge pages and split huge pages
 +  * to small pages.
 +  */

commenting style

this is a bit verbose for a field in a struct, perhaps moving the longer
version to where you set this?

 + int migration_in_progress;
  };
  
  #define KVM_NR_MEM_OBJS 40
 @@ -233,4 +238,6 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, 
 u64 value);
  
  void kvm_tlb_flush_vmid(struct kvm *kvm);
  
 +int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
 +
  #endif /* __ARM_KVM_HOST_H__ */
 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 5c7aa3c..b339fa9 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -114,13 +114,27 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
   pmd_val(*pmd) |= L_PMD_S2_RDWR;
  }
  
 +static inline void kvm_set_s2pte_readonly(pte_t *pte)
 +{
 + pte_val(*pte) = ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);

This relies on the pte already having been set as RDONLY or RDWR, if you
are creating a new pte and calling this function it could be easy to
miss that distinction, I would prefer:

pte_val(*pte) = L_PTE_S2_RDWR;
pte_val(*pte) |= L_PTE_S2_RDONLY;

 +}
 +
 +static inline bool kvm_s2pte_readonly(pte_t *pte)
 +{
 + return (pte_val(*pte)  L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
 +}
 +
  /* Open coded p*d_addr_end that can deal with 64bit addresses */
  #define kvm_pgd_addr_end(addr, end)  \
  ({   u64 __boundary = ((addr) + PGDIR_SIZE)  PGDIR_MASK;\
   (__boundary - 1  (end) - 1)? __boundary: (end);\
  })
  
 -#define kvm_pud_addr_end(addr,end)   (end)
 +/* For - 4-level table walk return PUD range end if end  1GB */

not sure you need this comment, the scheme is very common all over the
kernel.

 +#define kvm_pud_addr_end(addr, end)  \
 +({   u64 __boundary = ((addr) + PUD_SIZE)  PUD_MASK;\
 + (__boundary - 1  (end) - 1) ? __boundary : (end);  \
 +})

why do we need this?  We should only ever have 3 levels of page tables,
right?

  
  #define kvm_pmd_addr_end(addr, end)  \
  ({   u64 __boundary = ((addr) + PMD_SIZE)  PMD_MASK;\
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 3c82b37..1055266 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -234,6 +234,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
  struct kvm_userspace_memory_region *mem,
  enum kvm_mr_change change)
  {
 + /* Request for migration issued by user, write protect memory slot */

Does this necessarily only happen when there's a request for migration?
Isn't it just a log call that could be used for other things
(potentially)?

 + if ((change != KVM_MR_DELETE)  (mem-flags  KVM_MEM_LOG_DIRTY_PAGES))
 + return kvm_mmu_slot_remove_write_access(kvm, mem-slot);
   return 0;
  }
  
 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 95c172a..85145d8 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -743,6 +743,185 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, 
 phys_addr_t *ipap)
   return false;
  }
  
 +/* kvm_split_pmd - splits huge pages to small pages, required to keep a dirty
 + *   log of smaller memory granules, otherwise huge pages would need to be
 + *   migrated. Practically an idle system has problems migrating with

This seems abrupt.  Why can't we just represent a 2M huge page as 512 4K
bits and write protect the huge pages, if you take a write fault on a 2M
page, then split it then.

If your use case is HA, then you will be doing this a lot, and you don't
want to hurt performance of your main live system more than necessary.

 + *   huge 

Re: [PATCH v5 2/4] live migration support for initial write protect of VM

2014-05-15 Thread Mario Smarduch
On 05/15/2014 11:53 AM, Christoffer Dall wrote:
 
 [I know you sent out a newer version but I already reviewed some of this
 patch on the plane today but couldn't send it out before I got home.
 Anyway, here it is:]
 
 On Wed, May 07, 2014 at 05:40:14PM -0700, Mario Smarduch wrote:
 Patch adds support for live migration initial split up of huge pages
 in memory slot and write protection of all pages in memory slot.

 Signed-off-by: Mario Smarduch m.smard...@samsung.com
 ---
  arch/arm/include/asm/kvm_host.h |7 ++
  arch/arm/include/asm/kvm_mmu.h  |   16 +++-
  arch/arm/kvm/arm.c  |3 +
  arch/arm/kvm/mmu.c  |  179 
 +++
  virt/kvm/kvm_main.c |6 +-
  5 files changed, 209 insertions(+), 2 deletions(-)

 diff --git a/arch/arm/include/asm/kvm_host.h 
 b/arch/arm/include/asm/kvm_host.h
 index ac3bb65..91744c3 100644
 --- a/arch/arm/include/asm/kvm_host.h
 +++ b/arch/arm/include/asm/kvm_host.h
 @@ -67,6 +67,11 @@ struct kvm_arch {
  
  /* Interrupt controller */
  struct vgic_distvgic;
 +/* Marks start of migration, used to handle 2nd stage page faults
 + * during migration, prevent installing huge pages and split huge pages
 + * to small pages.
 + */
 
 commenting style
 
 this is a bit verbose for a field in a struct, perhaps moving the longer
 version to where you set this?
Will do.
 
 +int migration_in_progress;
  };
  
  #define KVM_NR_MEM_OBJS 40
 @@ -233,4 +238,6 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, 
 u64 value);
  
  void kvm_tlb_flush_vmid(struct kvm *kvm);
  
 +int kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
 +
  #endif /* __ARM_KVM_HOST_H__ */
 diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
 index 5c7aa3c..b339fa9 100644
 --- a/arch/arm/include/asm/kvm_mmu.h
 +++ b/arch/arm/include/asm/kvm_mmu.h
 @@ -114,13 +114,27 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
  pmd_val(*pmd) |= L_PMD_S2_RDWR;
  }
  
 +static inline void kvm_set_s2pte_readonly(pte_t *pte)
 +{
 +pte_val(*pte) = ~(L_PTE_S2_RDONLY ^ L_PTE_S2_RDWR);
 
 This relies on the pte already having been set as RDONLY or RDWR, if you
 are creating a new pte and calling this function it could be easy to
 miss that distinction, I would prefer:
 
 pte_val(*pte) = L_PTE_S2_RDWR;
 pte_val(*pte) |= L_PTE_S2_RDONLY;

Currently it's called only on set, or live pte's, I'll change
it so it's applicate to all cases. 
 
 +}
 +
 +static inline bool kvm_s2pte_readonly(pte_t *pte)
 +{
 +return (pte_val(*pte)  L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
 +}
 +
  /* Open coded p*d_addr_end that can deal with 64bit addresses */
  #define kvm_pgd_addr_end(addr, end) \
  ({  u64 __boundary = ((addr) + PGDIR_SIZE)  PGDIR_MASK;\
  (__boundary - 1  (end) - 1)? __boundary: (end);\
  })
  
 -#define kvm_pud_addr_end(addr,end)  (end)
 +/* For - 4-level table walk return PUD range end if end  1GB */
 
 not sure you need this comment, the scheme is very common all over the
 kernel.
Yes.
 
 +#define kvm_pud_addr_end(addr, end) \
 +({  u64 __boundary = ((addr) + PUD_SIZE)  PUD_MASK;\
 +(__boundary - 1  (end) - 1) ? __boundary : (end);  \
 +})
 
 why do we need this?  We should only ever have 3 levels of page tables,
 right?

I removed in v6 patch.
 
  
  #define kvm_pmd_addr_end(addr, end) \
  ({  u64 __boundary = ((addr) + PMD_SIZE)  PMD_MASK;\
 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 3c82b37..1055266 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -234,6 +234,9 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 struct kvm_userspace_memory_region *mem,
 enum kvm_mr_change change)
  {
 +/* Request for migration issued by user, write protect memory slot */
 
 Does this necessarily only happen when there's a request for migration?
 Isn't it just a log call that could be used for other things
 (potentially)?


From QEMU view migration thread calls KVM memory listener kvm_log_global_start
and that kicks off dirty log tracking for each memslot. There are other 
operations 
like region add (kvm_region_add)  that starts kvm_log_start for that memslot, 
or other odd case if you add a region that overlaps regions you may start 
logging
the whole region.
 
But in either case it appears you're migrating already. 

But no I don't see any other feature that triggers this.

 
 +if ((change != KVM_MR_DELETE)  (mem-flags  KVM_MEM_LOG_DIRTY_PAGES))
 +return kvm_mmu_slot_remove_write_access(kvm, mem-slot);
  return 0;
  }
  
 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 95c172a..85145d8 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -743,6 +743,185 @@