Re: Using virtio for inter-VM communication

2014-06-15 Thread Jan Kiszka
On 2014-06-13 10:45, Paolo Bonzini wrote:
 Il 13/06/2014 08:23, Jan Kiszka ha scritto:
 That would preserve zero-copy capabilities (as long as you can work
 against the shared mem directly, e.g. doing DMA from a physical NIC or
 storage device into it) and keep the hypervisor out of the loop.
 
  This seems ill thought out.  How will you program a NIC via the virtio
  protocol without a hypervisor?  And how will you make it safe?  You'll
  need an IOMMU.  But if you have an IOMMU you don't need shared memory.

 Scenarios behind this are things like driver VMs: You pass through the
 physical hardware to a driver guest that talks to the hardware and
 relays data via one or more virtual channels to other VMs. This confines
 a certain set of security and stability risks to the driver VM.
 
 I think implementing Xen hypercalls in jailhouse for grant table and
 event channels would actually make a lot of sense.  The Xen
 implementation is 2.5kLOC and I think it should be possible to compact
 it noticeably, especially if you limit yourself to 64-bit guests.

At least the grant table model seems unsuited for Jailhouse. It allows a
guest to influence the mapping of another guest during runtime. This we
want (or even have) to avoid in Jailhouse.

I'm therefore more in favor of a model where the shared memory region is
defined on cell (guest) creation by adding a virtual device that comes
with such a region.

Jan

 
 It should also be almost enough to run Xen PVH guests as jailhouse
 partitions.
 
 If later Xen starts to support virtio, you will get that for free.
 
 Paolo




signature.asc
Description: OpenPGP digital signature


[PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-15 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

This patch extracts the logic for the exchange of new and previous tail
code words into a new xchg_tail() function which can be optimized in a
later patch.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   58 +-
 2 files changed, 38 insertions(+), 22 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -61,6 +61,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp
node-next = NULL;
 
/*
-* we already touched the queueing cacheline; don't bother with pending
-* stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; trylock
-* p,y,x - n,y,x ; prev = xchg(lock, node)
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* we already touched the queueing cacheline; don't bother with pending
+* stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail);
 
/*
 * if there was a previous node; link it and wait.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev = decode_tail(old);
ACCESS_ONCE(prev-next) = node;
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/11] qspinlock, x86: Enable x86-64 to use queue spinlock

2014-06-15 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   25 +
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 35 insertions(+)
 create mode 100644 arch/x86/include/asm/qspinlock.h

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -29,6 +29,7 @@ config X86
select ARCH_SUPPORTS_NUMA_BALANCING if X86_64
select ARCH_SUPPORTS_INT128 if X86_64
select ARCH_WANTS_PROT_NUMA_PROT_NONE
+   select ARCH_USE_QUEUE_SPINLOCK
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,25 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * An effective smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   barrier();
+   ACCESS_ONCE(*(u8 *)lock) = 0;
+}
+
+#endif /* !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE */
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -180,6 +184,7 @@ static __always_inline void arch_spin_lo
 {
arch_spin_lock(lock);
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #ifdef CONFIG_QUEUE_RWLOCK
 #include asm-generic/qrwlock_types.h


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] KVM: x86: bit-ops emulation ignores offset on 64-bit

2014-06-15 Thread Nadav Amit
The current emulation of bit operations ignores the offset from the destination
on 64-bit target memory operands. This patch fixes this behavior.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index e4e833d..f0b0a10 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1220,12 +1220,14 @@ static void fetch_bit_operand(struct x86_emulate_ctxt 
*ctxt)
long sv = 0, mask;
 
if (ctxt-dst.type == OP_MEM  ctxt-src.type == OP_REG) {
-   mask = ~(ctxt-dst.bytes * 8 - 1);
+   mask = ~((long)ctxt-dst.bytes * 8 - 1);
 
if (ctxt-src.bytes == 2)
sv = (s16)ctxt-src.val  (s16)mask;
else if (ctxt-src.bytes == 4)
sv = (s32)ctxt-src.val  (s32)mask;
+   else
+   sv = (s64)ctxt-src.val  (s64)mask;
 
ctxt-dst.addr.mem.ea += (sv  3);
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] KVM: x86: check DR6/7 high-bits are clear only on long-mode

2014-06-15 Thread Nadav Amit
From: Nadav Amit nadav.a...@gmail.com

When the guest sets DR6 and DR7, KVM asserts the high 32-bits are clear, and
otherwise injects a #GP exception. This exception should only be injected only
if running in long-mode.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/x86.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 57eac30..71fe841 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -756,6 +756,15 @@ static void kvm_update_dr7(struct kvm_vcpu *vcpu)
vcpu-arch.switch_db_regs |= KVM_DEBUGREG_BP_ENABLED;
 }
 
+static bool is_64_bit_mode(struct kvm_vcpu *vcpu)
+{
+   int cs_db, cs_l;
+   if (!is_long_mode(vcpu))
+   return false;
+   kvm_x86_ops-get_cs_db_l_bits(vcpu, cs_db, cs_l);
+   return cs_l;
+}
+
 static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
 {
switch (dr) {
@@ -769,7 +778,7 @@ static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, 
unsigned long val)
return 1; /* #UD */
/* fall through */
case 6:
-   if (val  0xULL)
+   if ((val  0xULL)  is_64_bit_mode(vcpu))
return -1; /* #GP */
vcpu-arch.dr6 = (val  DR6_VOLATILE) | DR6_FIXED_1;
kvm_update_dr6(vcpu);
@@ -779,7 +788,7 @@ static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, 
unsigned long val)
return 1; /* #UD */
/* fall through */
default: /* 7 */
-   if (val  0xULL)
+   if ((val  0xULL)  is_64_bit_mode(vcpu))
return -1; /* #GP */
vcpu-arch.dr7 = (val  DR7_VOLATILE) | DR7_FIXED_1;
kvm_update_dr7(vcpu);
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled

2014-06-15 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

This patch renames the paravirt_ticketlocks_enabled static key to a
more generic paravirt_spinlocks_enabled name.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/include/asm/spinlock.h  |4 ++--
 arch/x86/kernel/kvm.c|2 +-
 arch/x86/kernel/paravirt-spinlocks.c |4 ++--
 arch/x86/xen/spinlock.c  |2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -39,7 +39,7 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1  15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
+extern struct static_key paravirt_spinlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
 #ifdef CONFIG_QUEUE_SPINLOCK
@@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowp
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
if (TICKET_SLOWPATH_FLAG 
-   static_key_false(paravirt_ticketlocks_enabled)) {
+   static_key_false(paravirt_spinlocks_enabled)) {
arch_spinlock_t prev;
 
prev = *lock;
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -819,7 +819,7 @@ static __init int kvm_spinlock_init_jump
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
printk(KERN_INFO KVM setup paravirtual spinlock\n);
 
return 0;
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = {
 };
 EXPORT_SYMBOL(pv_lock_ops);
 
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
+struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE;
+EXPORT_SYMBOL(paravirt_spinlocks_enabled);
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jum
if (!xen_domain())
return 0;
 
-   static_key_slow_inc(paravirt_ticketlocks_enabled);
+   static_key_slow_inc(paravirt_spinlocks_enabled);
return 0;
 }
 early_initcall(xen_init_spinlocks_jump);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/11] qspinlock, kvm: Add paravirt support

2014-06-15 Thread Peter Zijlstra


Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/kernel/kvm.c |   58 ++
 kernel/Kconfig.locks  |2 -
 2 files changed, 59 insertions(+), 1 deletion(-)

Index: linux-2.6/arch/x86/kernel/kvm.c
===
--- linux-2.6.orig/arch/x86/kernel/kvm.c
+++ linux-2.6/arch/x86/kernel/kvm.c
@@ -569,6 +569,7 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+#ifndef CONFIG_QUEUE_SPINLOCK
 enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -796,6 +797,51 @@ static void kvm_unlock_kick(struct arch_
}
}
 }
+#else /* QUEUE_SPINLOCK */
+
+#include asm-generic/qspinlock.h
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_init_node);
+PV_CALLEE_SAVE_REGS_THUNK(__pv_link_and_wait_node);
+PV_CALLEE_SAVE_REGS_THUNK(__pv_kick_node);
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_wait_head);
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_unlock);
+
+void kvm_wait(int *ptr, int val)
+{
+   unsigned long flags;
+
+   if (in_nmi())
+   return;
+
+   /*
+* Make sure an interrupt handler can't upset things in a
+* partially setup state.
+*/
+   local_irq_save(flags);
+
+   /*
+* check again make sure it didn't become free while
+* we weren't looking.
+*/
+   if (ACCESS_ONCE(*ptr) != val)
+   goto out;
+
+   /*
+* halt until it's our turn and kicked. Note that we do safe halt
+* for irq enabled case to avoid hang when lock info is overwritten
+* in irq spinlock slowpath and no spurious interrupt occur to save us.
+*/
+   if (arch_irqs_disabled_flags(flags))
+   halt();
+   else
+   safe_halt();
+
+out:
+   local_irq_restore(flags);
+}
+#endif /* QUEUE_SPINLOCK */
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -808,8 +854,20 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+   pv_lock_ops.init_node = PV_CALLEE_SAVE(__pv_init_node);
+   pv_lock_ops.link_and_wait_node = 
PV_CALLEE_SAVE(__pv_link_and_wait_node);
+   pv_lock_ops.kick_node = PV_CALLEE_SAVE(__pv_kick_node);
+
+   pv_lock_ops.wait_head = PV_CALLEE_SAVE(__pv_wait_head);
+   pv_lock_ops.queue_unlock = PV_CALLEE_SAVE(__pv_queue_unlock);
+
+   pv_lock_ops.wait = kvm_wait;
+   pv_lock_ops.kick = kvm_kick_cpu;
+#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
Index: linux-2.6/kernel/Kconfig.locks
===
--- linux-2.6.orig/kernel/Kconfig.locks
+++ linux-2.6/kernel/Kconfig.locks
@@ -229,7 +229,7 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
-   depends on SMP  !PARAVIRT_SPINLOCKS
+   depends on SMP  !(PARAVIRT_SPINLOCKS  XEN)
 
 config ARCH_USE_QUEUE_RWLOCK
bool


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX

2014-06-15 Thread Nadav Amit
On long-mode the current NOP (0x90) emulation still writes back to RAX.  As a
result, EAX is zero-extended and the high 32-bits of RAX are cleared.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index b354531..eb93eb4 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4663,8 +4663,9 @@ special_insn:
break;
case 0x90 ... 0x97: /* nop / xchg reg, rax */
if (ctxt-dst.addr.reg == reg_rmw(ctxt, VCPU_REGS_RAX))
-   break;
-   rc = em_xchg(ctxt);
+   ctxt-dst.type = OP_NONE;
+   else
+   rc = em_xchg(ctxt);
break;
case 0x98: /* cbw/cwde/cdqe */
switch (ctxt-op_bytes) {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 06/16] qspinlock: prolong the stay in the pending bit path

2014-06-15 Thread Peter Zijlstra
On Thu, Jun 12, 2014 at 04:54:52PM -0400, Waiman Long wrote:
 If two tasks see the pending bit goes away and try to grab it with cmpxchg,
 there is no way we can avoid the contention. However, if some how the
 pending bit holder get the lock and another task set the pending bit before
 the current task, the spinlock value will become
 _Q_PENDING_VAL|_Q_LOCKED_VAL. The while loop will end and the code will
 blindly try to do a cmpxchg unless we check for this case before hand. This
 is what my code does by going back to the beginning of the for loop.

There is already a test for that; see the goto queue;

---

/*
 * wait for in-progress pending-locked hand-overs
 *
 * 0,1,0 - 0,0,1
 */
if (val == _Q_PENDING_VAL) {
while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL)
cpu_relax();
}

/*
 * trylock || pending
 *
 * 0,0,0 - 0,0,1 ; trylock
 * 0,0,1 - 0,1,1 ; pending
 */
for (;;) {
/*
 * If we observe any contention; queue.
 */
if (val  ~_Q_LOCKED_MASK)
goto queue;

new = _Q_LOCKED_VAL;
if (val == new)
new |= _Q_PENDING_VAL;

old = atomic_cmpxchg(lock-val, val, new);
if (old == val)
break;

val = old;
}



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] KVM: x86: Wrong emulation on 'xadd X, X'

2014-06-15 Thread Nadav Amit
The emulator does not emulate the xadd instruction correctly if the two
operands are the same.  In this (unlikely) situation the result should be the
sum of X and X (2X) when it is currently X.  The solution is to first perform
writeback to the source, before writing to the destination.  The only
instruction which should be affected is xadd, as the other instructions that
perform writeback to the source use the extended accumlator (e.g., RAX:RDX).

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index f0b0a10..3c8d867 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4711,17 +4711,17 @@ special_insn:
goto done;
 
 writeback:
-   if (!(ctxt-d  NoWrite)) {
-   rc = writeback(ctxt, ctxt-dst);
-   if (rc != X86EMUL_CONTINUE)
-   goto done;
-   }
if (ctxt-d  SrcWrite) {
BUG_ON(ctxt-src.type == OP_MEM || ctxt-src.type == 
OP_MEM_STR);
rc = writeback(ctxt, ctxt-src);
if (rc != X86EMUL_CONTINUE)
goto done;
}
+   if (!(ctxt-d  NoWrite)) {
+   rc = writeback(ctxt, ctxt-dst);
+   if (rc != X86EMUL_CONTINUE)
+   goto done;
+   }
 
/*
 * restore dst type in case the decoding will be reused
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-06-15 Thread Peter Zijlstra
On Thu, Jun 12, 2014 at 05:08:28PM -0400, Waiman Long wrote:
 Native performance is king, try your very utmost bestest to preserve
 that, paravirt is a distant second and nobody sane should care about the
 virt case at all.
 
 The patch won't affect native performance unless the kernel is built with
 VIRT_UNFAIR_LOCKS selected. The same is also true when PARAVIRT_SPINLOCKS is
 selected. There is no way around that.

VIRT_UNFAIR_LOCKS is an impossible switch to have; a distro cannot make
the right choice.

 I do agree that I may over-engineer on this patch,

Simple things first, then add complexity.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/6] KVM: x86: emulation of dword cmov on long-mode should clear [63:32]

2014-06-15 Thread Nadav Amit
Even if the condition of cmov is not satisfied, bits[63:32] should be cleared.
This is clearly stated in Intel's CMOVcc documentation.  The solution is to
reassign the destination onto itself if the condition is unsatisfied.  For that
matter the original destination value needs to be read.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 0183350..b354531 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -3905,7 +3905,7 @@ static const struct opcode twobyte_table[256] = {
N, N,
N, N, N, N, N, N, N, N,
/* 0x40 - 0x4F */
-   X16(D(DstReg | SrcMem | ModRM | Mov)),
+   X16(D(DstReg | SrcMem | ModRM)),
/* 0x50 - 0x5F */
N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N,
/* 0x60 - 0x6F */
@@ -4799,8 +4799,10 @@ twobyte_insn:
ops-get_dr(ctxt, ctxt-modrm_reg, ctxt-dst.val);
break;
case 0x40 ... 0x4f: /* cmov */
-   ctxt-dst.val = ctxt-dst.orig_val = ctxt-src.val;
-   if (!test_cc(ctxt-b, ctxt-eflags))
+   if (test_cc(ctxt-b, ctxt-eflags))
+   ctxt-dst.val = ctxt-src.val;
+   else if (ctxt-mode != X86EMUL_MODE_PROT64 ||
+ctxt-op_bytes != 4)
ctxt-dst.type = OP_NONE; /* no writeback */
break;
case 0x80 ... 0x8f: /* jnz rel, etc*/
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] KVM: x86: More emulator bugs

2014-06-15 Thread Nadav Amit
This patch-set resolves several emulator bugs. Each fix is independent of the
others.  The DR6/7 bug can occur during DR-access exit (regardless to
unrestricted mode, MMIO and SPT).

Thanks for reviewing the patches,
Nadav

Nadav Amit (6):
  KVM: x86: bit-ops emulation ignores offset on 64-bit
  KVM: x86: Wrong emulation on 'xadd X, X'
  KVM: x86: Inter privilage level ret emulation is not implemeneted
  KVM: x86: emulation of dword cmov on long-mode should clear [63:32]
  KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX
  KVM: x86: check DR6/7 high-bits are clear only on long-mode

 arch/x86/kvm/emulate.c | 31 ---
 arch/x86/kvm/x86.c | 13 +++--
 2 files changed, 31 insertions(+), 13 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/6] KVM: x86: Inter privilage level ret emulation is not implemeneted

2014-06-15 Thread Nadav Amit
Return unhandlable error on inter-privilage level ret instruction.  This is
since the current emulation does not check the privilage level correctly when
loading the CS, and does not pop RSP/SS as needed.

Signed-off-by: Nadav Amit na...@cs.technion.ac.il
---
 arch/x86/kvm/emulate.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 3c8d867..0183350 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2019,6 +2019,7 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
 {
int rc;
unsigned long cs;
+   int cpl = ctxt-ops-cpl(ctxt);
 
rc = emulate_pop(ctxt, ctxt-_eip, ctxt-op_bytes);
if (rc != X86EMUL_CONTINUE)
@@ -2028,6 +2029,9 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt)
rc = emulate_pop(ctxt, cs, ctxt-op_bytes);
if (rc != X86EMUL_CONTINUE)
return rc;
+   /* Outer-privilage level return is not implemented */
+   if (ctxt-mode = X86EMUL_MODE_PROT16  (cs  3)  cpl)
+   return X86EMUL_UNHANDLEABLE;
rc = load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS);
return rc;
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/11] qspinlock: Optimize pending bit

2014-06-15 Thread Peter Zijlstra
XXX: merge into the pending bit patch..

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed to
be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 kernel/locking/qspinlock.c |   10 ++
 1 file changed, 10 insertions(+)

Index: linux-2.6/kernel/locking/qspinlock.c
===
--- linux-2.6.orig/kernel/locking/qspinlock.c
+++ linux-2.6/kernel/locking/qspinlock.c
@@ -226,6 +226,16 @@ void queue_spin_lock_slowpath(struct qsp
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
/*
+* wait for in-progress pending-locked hand-overs
+*
+* 0,1,0 - 0,0,1
+*/
+   if (val == _Q_PENDING_VAL) {
+   while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL)
+   cpu_relax();
+   }
+
+   /*
 * trylock || pending
 *
 * 0,0,0 - 0,0,1 ; trylock


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/11] qspinlock: Revert to test-and-set on hypervisors

2014-06-15 Thread Peter Zijlstra
When we detect a hypervisor (!paravirt, see later patches), revert to
a simple test-and-set lock to avoid the horrors of queue preemption.

Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/include/asm/qspinlock.h |   14 ++
 include/asm-generic/qspinlock.h  |7 +++
 kernel/locking/qspinlock.c   |3 +++
 3 files changed, 24 insertions(+)

--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
 
 #if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
@@ -20,6 +21,19 @@ static inline void queue_spin_unlock(str
 
 #endif /* !CONFIG_X86_OOSTORE  !CONFIG_X86_PPRO_FENCE */
 
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include asm-generic/qspinlock.h
 
 #endif /* _ASM_X86_QSPINLOCK_H */
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -98,6 +98,13 @@ static __always_inline void queue_spin_u
 }
 #endif
 
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -247,6 +247,9 @@ void queue_spin_lock_slowpath(struct qsp
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (virt_queue_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending-locked hand-overs
 *


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/11] qspinlock: Add pending bit

2014-06-15 Thread Peter Zijlstra
Because the qspinlock needs to touch a second cacheline; add a pending
bit and allow a single in-word spinner before we punt to the second
cacheline.

Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |   12 ++-
 kernel/locking/qspinlock.c|  109 +++---
 2 files changed, 97 insertions(+), 24 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -39,8 +39,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -48,7 +49,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -57,5 +62,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -83,24 +83,28 @@ static inline struct mcs_spinlock *decod
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock bit)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock bit)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
@@ -110,6 +114,65 @@ void queue_spin_lock_slowpath(struct qsp
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   /*
+* trylock || pending
+*
+* 0,0,0 - 0,0,1 ; trylock
+* 0,0,1 - 0,1,1 ; pending
+*/
+   for (;;) {
+   /*
+* If we observe any contention; queue.
+*/
+   if (val  ~_Q_LOCKED_MASK)
+   goto queue;
+
+   new = _Q_LOCKED_VAL;
+   if (val == new)
+   new |= _Q_PENDING_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+
+   /*
+* we won the trylock
+*/
+   if (new == _Q_LOCKED_VAL)
+   return;
+
+   /*
+* we're pending, wait for the owner to go away.
+*
+* *,1,1 - *,1,0
+*/
+   while ((val = atomic_read(lock-val))  _Q_LOCKED_MASK)
+   cpu_relax();
+
+   /*
+* take ownership and clear the pending 

[PATCH 00/11] qspinlock with paravirt support

2014-06-15 Thread Peter Zijlstra
Since Waiman seems incapable of doing simple things; here's my take on the
paravirt crap.

The first few patches are taken from Waiman's latest series, but the virt
support is completely new. Its primary aim is to not mess up the native code.

I've not stress tested it, but the virt and paravirt (kvm) cases boot on simple
smp guests. I've not done Xen, but the patch should be simple and similar.

I ripped out all the unfair nonsense as its not at all required for paravirt
and optimizations that make paravirt better at the cost of code clarity and/or
native performance are just not worth it.

Also; if we were to ever add some of that unfair nonsense you do so _after_ you
got the simple things working.

The thing I'm least sure about is the head tracking, I chose to do something
different from what Waiman did, because his is O(nr_cpus) and had the
assumption that guests have small nr_cpus. AFAIK this is not at all true. The
biggest problem I have with what I did is that it contains wait loops itself.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 14/16] pvqspinlock: Add qspinlock para-virtualization support

2014-06-15 Thread Peter Zijlstra
On Thu, Jun 12, 2014 at 04:48:41PM -0400, Waiman Long wrote:
 I don't have a good understanding of the kernel alternatives mechanism.

I didn't either; I do now, cost me a whole day reading up on
alternative/paravirt code patching.

See the patches I just send out; I got the 'native' case with paravirt
enabled to be one NOP worse than the native case without paravirt -- for
queue_spin_unlock.

The lock slowpath is several nops and some pointless movs more expensive.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/11] qspinlock: Paravirt support

2014-06-15 Thread Peter Zijlstra
Add minimal paravirt support.

The code aims for minimal impact on the native case.

On the lock side we add one jump label (asm_goto) and 4 paravirt
callee saved calls that default to NOPs. The only effects are the
extra NOPs and some pointless MOVs to accomodate the calling
convention. No register spills happen because of this (x86_64).

On the unlock side we have one paravirt callee saved call, which
defaults to the actual unlock sequence: movb $0, (%rdi) and a NOP.

The actual paravirt code comes in 3 parts;

 - init_node; this initializes the extra data members required for PV
   state. PV state data is kept 1 cacheline ahead of the regular data.

 - link_and_wait_node/kick_node; these are paired with the regular MCS
   queueing and are placed resp. before/after the paired MCS ops.

 - wait_head/queue_unlock; the interesting part here is finding the
   head node to kick.

Tracking the head is done in two parts, firstly the pv_wait_head will
store its cpu number in whichever node is pointed to by the tail part
of the lock word. Secondly, pv_link_and_wait_node() will propagate the
existing head from the old to the new tail node.

Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 arch/x86/include/asm/paravirt.h   |   39 +++
 arch/x86/include/asm/paravirt_types.h |   15 ++
 arch/x86/include/asm/qspinlock.h  |   25 
 arch/x86/kernel/paravirt-spinlocks.c  |   22 
 arch/x86/kernel/paravirt_patch_32.c   |7 +
 arch/x86/kernel/paravirt_patch_64.c   |7 +
 include/asm-generic/qspinlock.h   |   11 ++
 kernel/locking/qspinlock.c|  179 +-
 8 files changed, 302 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/include/asm/paravirt.h
===
--- linux-2.6.orig/arch/x86/include/asm/paravirt.h
+++ linux-2.6/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,44 @@ static inline void __set_fixmap(unsigned
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_init_node(struct mcs_spinlock *node)
+{
+   PVOP_VCALLEE1(pv_lock_ops.init_node, node);
+}
+
+static __always_inline void pv_link_and_wait_node(u32 old, struct mcs_spinlock 
*node)
+{
+   PVOP_VCALLEE2(pv_lock_ops.link_and_wait_node, old, node);
+}
+
+static __always_inline void pv_kick_node(struct mcs_spinlock *node)
+{
+   PVOP_VCALLEE1(pv_lock_ops.kick_node, node);
+}
+
+static __always_inline void pv_wait_head(struct qspinlock *lock)
+{
+   PVOP_VCALLEE1(pv_lock_ops.wait_head, lock);
+}
+
+static __always_inline void pv_queue_unlock(struct qspinlock *lock)
+{
+   PVOP_VCALLEE1(pv_lock_ops.queue_unlock, lock);
+}
+
+static __always_inline void pv_wait(int *ptr, int val)
+{
+   PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick, cpu);
+}
+
+#else
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -723,6 +761,7 @@ static __always_inline void __ticket_unl
 {
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
+#endif
 
 #endif
 
Index: linux-2.6/arch/x86/include/asm/paravirt_types.h
===
--- linux-2.6.orig/arch/x86/include/asm/paravirt_types.h
+++ linux-2.6/arch/x86/include/asm/paravirt_types.h
@@ -326,6 +326,9 @@ struct pv_mmu_ops {
   phys_addr_t phys, pgprot_t flags);
 };
 
+struct mcs_spinlock;
+struct qspinlock;
+
 struct arch_spinlock;
 #ifdef CONFIG_SMP
 #include asm/spinlock_types.h
@@ -334,8 +337,20 @@ typedef u16 __ticket_t;
 #endif
 
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   struct paravirt_callee_save init_node;
+   struct paravirt_callee_save link_and_wait_node;
+   struct paravirt_callee_save kick_node;
+
+   struct paravirt_callee_save wait_head;
+   struct paravirt_callee_save queue_unlock;
+
+   void (*wait)(int *ptr, int val);
+   void (*kick)(int cpu);
+#else
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif
 };
 
 /* This contains all the paravirt structures: we get a convenient
Index: linux-2.6/arch/x86/include/asm/qspinlock.h
===
--- linux-2.6.orig/arch/x86/include/asm/qspinlock.h
+++ linux-2.6/arch/x86/include/asm/qspinlock.h
@@ -3,24 +3,45 @@
 
 #include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
+#include asm/paravirt.h
 
 #if !defined(CONFIG_X86_OOSTORE)  !defined(CONFIG_X86_PPRO_FENCE)
 
-#definequeue_spin_unlock queue_spin_unlock
 /**
  * queue_spin_unlock - release a queue spinlock
  * @lock : Pointer to queue spinlock structure
  *
  * An effective smp_store_release() on the least-significant byte.
  */

[PATCH 07/11] qspinlock: Use a simple write to grab the lock, if applicable

2014-06-15 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

Currently, atomic_cmpxchg() is used to get the lock. However, this is
not really necessary if there is more than one task in the queue and
the queue head don't need to reset the queue code word. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
queue code word (tail) in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 kernel/locking/qspinlock.c |   59 -
 1 file changed, 43 insertions(+), 16 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decod
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   u8   locked;
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[3];
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -197,6 +206,19 @@ static __always_inline u32 xchg_tail(str
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 - *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -328,10 +350,13 @@ void queue_spin_lock_slowpath(struct qsp
/*
   

[PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS

2014-06-15 Thread Peter Zijlstra
From: Peter Zijlstra pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |   13 
 kernel/locking/qspinlock.c|  103 ++
 2 files changed, 106 insertions(+), 10 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -38,6 +38,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -50,7 +58,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -61,6 +73,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -22,6 +22,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -48,6 +49,9 @@
  * We can further change the first spinner to spin on a bit in the lock word
  * instead of its node; whereby avoiding the need to carry a node from lock to
  * unlock, and preserving API.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
  */
 
 #include mcs_spinlock.h
@@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decod
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   ACCESS_ONCE(l-locked_pending) = _Q_LOCKED_VAL;
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   u32 new, old;
+
+   for (;;) {
+   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+}
+
 /**
  * xchg_tail - Put in the new 

[PATCH 01/11] qspinlock: A simple generic 4-byte queue spinlock

2014-06-15 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  118 
 include/asm-generic/qspinlock_types.h |   61 ++
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 
 kernel/locking/mcs_spinlock.h |1 
 kernel/locking/qspinlock.c|  197 ++
 6 files changed, 385 insertions(+)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

Index: linux-2.6/include/asm-generic/qspinlock.h
===
--- /dev/null
+++ linux-2.6/include/asm-generic/qspinlock.h
@@ -0,0 +1,118 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)

[PATCH 5/5] KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Many real CPUs get this wrong as well, but ours is totally off: bits 9:1
define the highest index value.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b31e9f1..2aeb8ac 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2461,7 +2461,7 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
*pdata = -1ULL;
break;
case MSR_IA32_VMX_VMCS_ENUM:
-   *pdata = 0x1f;
+   *pdata = 0x2e; /* highest index: VMX_PREEMPTION_TIMER_VALUE */
break;
case MSR_IA32_VMX_PROCBASED_CTLS2:
*pdata = vmx_control_msr(nested_vmx_secondary_ctls_low,
-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] KVM: nVMX: Small fixes improving emulation accuracy

2014-06-15 Thread Jan Kiszka
Nothing critical, but it further improves emulation accuracy,
specifically helpful when analyzing guest bugs...

Corresponding kvm-unit-tests will be provided.

Jan Kiszka (5):
  KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS
  KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS
  KVM: nVMX: Allow to disable CR3 access interception
  KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS
  KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM

 arch/x86/include/asm/vmx.h|  3 ++
 arch/x86/include/uapi/asm/msr-index.h |  1 +
 arch/x86/kvm/vmx.c| 75 +--
 3 files changed, 58 insertions(+), 21 deletions(-)

-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Allow L1 to leak its debug controls into L2, i.e. permit cleared
VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS. This requires to manually
transfer the state of DR7 and IA32_DEBUGCTLMSR from L1 into L2 as both
run on different VMCS.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/vmx.c | 44 ++--
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 475f2dc..b31e9f1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -383,6 +383,9 @@ struct nested_vmx {
 
struct hrtimer preemption_timer;
bool preemption_timer_expired;
+
+   /* to migrate it to L2 if VM_ENTRY_LOAD_DEBUG_CONTROLS is off */
+   u64 host_debugctl;
 };
 
 #define POSTED_INTR_ON  0
@@ -2243,7 +2246,9 @@ static u32 nested_vmx_true_procbased_ctls_low;
 static u32 nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high;
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
+static u32 nested_vmx_true_exit_ctls_low;
 static u32 nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high;
+static u32 nested_vmx_true_entry_ctls_low;
 static u32 nested_vmx_misc_low, nested_vmx_misc_high;
 static u32 nested_vmx_ept_caps;
 static __init void nested_vmx_setup_ctls_msrs(void)
@@ -2289,6 +2294,10 @@ static __init void nested_vmx_setup_ctls_msrs(void)
if (vmx_mpx_supported())
nested_vmx_exit_ctls_high |= VM_EXIT_CLEAR_BNDCFGS;
 
+   /* We support free control of debug control saving. */
+   nested_vmx_true_exit_ctls_low = nested_vmx_exit_ctls_low 
+   ~VM_EXIT_SAVE_DEBUG_CONTROLS;
+
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high);
@@ -2303,6 +2312,10 @@ static __init void nested_vmx_setup_ctls_msrs(void)
if (vmx_mpx_supported())
nested_vmx_entry_ctls_high |= VM_ENTRY_LOAD_BNDCFGS;
 
+   /* We support free control of debug control loading. */
+   nested_vmx_true_entry_ctls_low = nested_vmx_entry_ctls_low 
+   ~VM_ENTRY_LOAD_DEBUG_CONTROLS;
+
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
@@ -2409,11 +2422,17 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
nested_vmx_procbased_ctls_high);
break;
case MSR_IA32_VMX_TRUE_EXIT_CTLS:
+   *pdata = vmx_control_msr(nested_vmx_true_exit_ctls_low,
+   nested_vmx_exit_ctls_high);
+   break;
case MSR_IA32_VMX_EXIT_CTLS:
*pdata = vmx_control_msr(nested_vmx_exit_ctls_low,
nested_vmx_exit_ctls_high);
break;
case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
+   *pdata = vmx_control_msr(nested_vmx_true_entry_ctls_low,
+   nested_vmx_entry_ctls_high);
+   break;
case MSR_IA32_VMX_ENTRY_CTLS:
*pdata = vmx_control_msr(nested_vmx_entry_ctls_low,
nested_vmx_entry_ctls_high);
@@ -7836,7 +7855,13 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
vmcs_writel(GUEST_GDTR_BASE, vmcs12-guest_gdtr_base);
vmcs_writel(GUEST_IDTR_BASE, vmcs12-guest_idtr_base);
 
-   vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12-guest_ia32_debugctl);
+   if (vmcs12-vm_entry_controls  VM_ENTRY_LOAD_DEBUG_CONTROLS) {
+   kvm_set_dr(vcpu, 7, vmcs12-guest_dr7);
+   vmcs_write64(GUEST_IA32_DEBUGCTL, vmcs12-guest_ia32_debugctl);
+   } else {
+   kvm_set_dr(vcpu, 7, vcpu-arch.dr7);
+   vmcs_write64(GUEST_IA32_DEBUGCTL, vmx-nested.host_debugctl);
+   }
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
vmcs12-vm_entry_intr_info_field);
vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE,
@@ -7846,7 +7871,6 @@ static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct 
vmcs12 *vmcs12)
vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
vmcs12-guest_interruptibility_info);
vmcs_write32(GUEST_SYSENTER_CS, vmcs12-guest_sysenter_cs);
-   kvm_set_dr(vcpu, 7, vmcs12-guest_dr7);
vmx_set_rflags(vcpu, vmcs12-guest_rflags);
vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
vmcs12-guest_pending_dbg_exceptions);
@@ -8143,9 +8167,11 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
launch)
!vmx_control_verify(vmcs12-pin_based_vm_exec_control,
  nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high) ||
!vmx_control_verify(vmcs12-vm_exit_controls,
- 

[PATCH 3/5] KVM: nVMX: Allow to disable CR3 access interception

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

We already had this control enabled by exposing the broken
MSR_IA32_VMX_PROCBASED_CTLS value. This now advertises our capability by
clearing the right bits in MSR_IA32_VMX_TRUE_PROCBASED_CTLS. We also
have to ensure to test the right value on L2 entry.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/vmx.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index cb43883..475f2dc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2239,6 +2239,7 @@ static inline bool nested_vmx_allowed(struct kvm_vcpu 
*vcpu)
  * or other means.
  */
 static u32 nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high;
+static u32 nested_vmx_true_procbased_ctls_low;
 static u32 nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high;
 static u32 nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high;
 static u32 nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high;
@@ -2329,6 +2330,10 @@ static __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_procbased_ctls_high |= CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
CPU_BASED_USE_MSR_BITMAPS;
 
+   /* We support free control of CR3 access interception. */
+   nested_vmx_true_procbased_ctls_low = nested_vmx_procbased_ctls_low 
+   ~(CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING);
+
/* secondary cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2,
nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high);
@@ -2396,6 +2401,9 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
nested_vmx_pinbased_ctls_high);
break;
case MSR_IA32_VMX_TRUE_PROCBASED_CTLS:
+   *pdata = vmx_control_msr(nested_vmx_true_procbased_ctls_low,
+   nested_vmx_procbased_ctls_high);
+   break;
case MSR_IA32_VMX_PROCBASED_CTLS:
*pdata = vmx_control_msr(nested_vmx_procbased_ctls_low,
nested_vmx_procbased_ctls_high);
@@ -8128,7 +8136,8 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool 
launch)
}
 
if (!vmx_control_verify(vmcs12-cpu_based_vm_exec_control,
- nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high) ||
+   nested_vmx_true_procbased_ctls_low,
+   nested_vmx_procbased_ctls_high) ||
!vmx_control_verify(vmcs12-secondary_vm_exec_control,
  nested_vmx_secondary_ctls_low, nested_vmx_secondary_ctls_high) ||
!vmx_control_verify(vmcs12-pin_based_vm_exec_control,
-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

We already implemented them but failed to advertise them. Currently they
all return the identical values to the capability MSRs they are
augmenting. So there is no change in exposed features yet.

Drop related comments at this chance that are partially incorrect and
redundant anyway.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/include/uapi/asm/msr-index.h |  1 +
 arch/x86/kvm/vmx.c| 13 ++---
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/uapi/asm/msr-index.h 
b/arch/x86/include/uapi/asm/msr-index.h
index fcf2b3a..579f62a 100644
--- a/arch/x86/include/uapi/asm/msr-index.h
+++ b/arch/x86/include/uapi/asm/msr-index.h
@@ -558,6 +558,7 @@
 
 /* VMX_BASIC bits and bitmasks */
 #define VMX_BASIC_VMCS_SIZE_SHIFT  32
+#define VMX_BASIC_TRUE_CTLS0x0080LLU
 #define VMX_BASIC_64   0x0001LLU
 #define VMX_BASIC_MEM_TYPE_SHIFT   50
 #define VMX_BASIC_MEM_TYPE_MASK0x003cLLU
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b38e03a..cb43883 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2265,21 +2265,13 @@ static __init void nested_vmx_setup_ctls_msrs(void)
/* pin-based controls */
rdmsr(MSR_IA32_VMX_PINBASED_CTLS,
  nested_vmx_pinbased_ctls_low, nested_vmx_pinbased_ctls_high);
-   /*
-* According to the Intel spec, if bit 55 of VMX_BASIC is off (as it is
-* in our case), bits 1, 2 and 4 (i.e., 0x16) must be 1 in this MSR.
-*/
nested_vmx_pinbased_ctls_low |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_pinbased_ctls_high = PIN_BASED_EXT_INTR_MASK |
PIN_BASED_NMI_EXITING | PIN_BASED_VIRTUAL_NMIS;
nested_vmx_pinbased_ctls_high |= PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
PIN_BASED_VMX_PREEMPTION_TIMER;
 
-   /*
-* Exit controls
-* If bit 55 of VMX_BASIC is off, bits 0-8 and 10, 11, 13, 14, 16 and
-* 17 must be 1.
-*/
+   /* exit controls */
rdmsr(MSR_IA32_VMX_EXIT_CTLS,
nested_vmx_exit_ctls_low, nested_vmx_exit_ctls_high);
nested_vmx_exit_ctls_low = VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
@@ -2299,7 +2291,6 @@ static __init void nested_vmx_setup_ctls_msrs(void)
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
nested_vmx_entry_ctls_low, nested_vmx_entry_ctls_high);
-   /* If bit 55 of VMX_BASIC is off, bits 0-8 and 12 must be 1. */
nested_vmx_entry_ctls_low = VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_entry_ctls_high =
 #ifdef CONFIG_X86_64
@@ -2395,7 +2386,7 @@ static int vmx_get_vmx_msr(struct kvm_vcpu *vcpu, u32 
msr_index, u64 *pdata)
 * guest, and the VMCS structure we give it - not about the
 * VMX support of the underlying hardware.
 */
-   *pdata = VMCS12_REVISION |
+   *pdata = VMCS12_REVISION | VMX_BASIC_TRUE_CTLS |
   ((u64)VMCS12_SIZE  VMX_BASIC_VMCS_SIZE_SHIFT) |
   (VMX_BASIC_MEM_TYPE_WB  VMX_BASIC_MEM_TYPE_SHIFT);
break;
-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

SDM says bits 1, 4-6, 8, 13-16, and 26 have to be set. Fixing this
temporarily revokes the ability of L1 to control CR3 interceptions.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/include/asm/vmx.h | 3 +++
 arch/x86/kvm/vmx.c | 5 +++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index d989829..bcbfade 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -51,6 +51,9 @@
 #define CPU_BASED_MONITOR_EXITING   0x2000
 #define CPU_BASED_PAUSE_EXITING 0x4000
 #define CPU_BASED_ACTIVATE_SECONDARY_CONTROLS   0x8000
+
+#define CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR0x0401e172
+
 /*
  * Definitions of Secondary Processor-Based VM-Execution Controls.
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 801332e..b38e03a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2314,7 +2314,7 @@ static __init void nested_vmx_setup_ctls_msrs(void)
/* cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
nested_vmx_procbased_ctls_low, nested_vmx_procbased_ctls_high);
-   nested_vmx_procbased_ctls_low = 0;
+   nested_vmx_procbased_ctls_low = CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
nested_vmx_procbased_ctls_high =
CPU_BASED_VIRTUAL_INTR_PENDING |
CPU_BASED_VIRTUAL_NMI_PENDING | CPU_BASED_USE_TSC_OFFSETING |
@@ -2335,7 +2335,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 * can use it to avoid exits to L1 - even when L0 runs L2
 * without MSR bitmaps.
 */
-   nested_vmx_procbased_ctls_high |= CPU_BASED_USE_MSR_BITMAPS;
+   nested_vmx_procbased_ctls_high |= CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
+   CPU_BASED_USE_MSR_BITMAPS;
 
/* secondary cpu-based controls */
rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2,
-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] VMX: Add tests for CR3 and CR8 interception

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Need to fix FIELD_* constants for this to make the exit qualification
check work.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 x86/vmx.h   |  2 ++
 x86/vmx_tests.c | 32 +---
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/x86/vmx.h b/x86/vmx.h
index 26dd161..69a5385 100644
--- a/x86/vmx.h
+++ b/x86/vmx.h
@@ -357,6 +357,8 @@ enum Ctrl0 {
CPU_RDTSC   = 1ul  12,
CPU_CR3_LOAD= 1ul  15,
CPU_CR3_STORE   = 1ul  16,
+   CPU_CR8_LOAD= 1ul  19,
+   CPU_CR8_STORE   = 1ul  20,
CPU_TPR_SHADOW  = 1ul  21,
CPU_NMI_WINDOW  = 1ul  22,
CPU_IO  = 1ul  24,
diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index a40cb18..d0ce365 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -820,8 +820,8 @@ static int iobmp_exit_handler()
 #define INSN_ALWAYS_TRAP   2
 #define INSN_NEVER_TRAP3
 
-#define FIELD_EXIT_QUAL0
-#define FIELD_INSN_INFO1
+#define FIELD_EXIT_QUAL0x1
+#define FIELD_INSN_INFO0x2
 
 asm(
insn_hlt: hlt;ret\n\t
@@ -829,6 +829,12 @@ asm(
insn_mwait: mwait;ret\n\t
insn_rdpmc: rdpmc;ret\n\t
insn_rdtsc: rdtsc;ret\n\t
+   insn_cr3_load: mov %rax,%cr3;ret\n\t
+   insn_cr3_store: mov %cr3,%rax;ret\n\t
+#ifdef __x86_64__
+   insn_cr8_load: mov %rax,%cr8;ret\n\t
+   insn_cr8_store: mov %cr8,%rax;ret\n\t
+#endif
insn_monitor: monitor;ret\n\t
insn_pause: pause;ret\n\t
insn_wbinvd: wbinvd;ret\n\t
@@ -840,6 +846,12 @@ extern void insn_invlpg();
 extern void insn_mwait();
 extern void insn_rdpmc();
 extern void insn_rdtsc();
+extern void insn_cr3_load();
+extern void insn_cr3_store();
+#ifdef __x86_64__
+extern void insn_cr8_load();
+extern void insn_cr8_store();
+#endif
 extern void insn_monitor();
 extern void insn_pause();
 extern void insn_wbinvd();
@@ -856,7 +868,7 @@ struct insn_table {
u32 reason;
ulong exit_qual;
u32 insn_info;
-   // Use FIELD_EXIT_QUAL and FIELD_INSN_INFO to efines
+   // Use FIELD_EXIT_QUAL and FIELD_INSN_INFO to define
// which field need to be tested, reason is always tested
u32 test_field;
 };
@@ -877,6 +889,16 @@ static struct insn_table insn_table[] = {
{MWAIT, CPU_MWAIT, insn_mwait, INSN_CPU0, 36, 0, 0, 0},
{RDPMC, CPU_RDPMC, insn_rdpmc, INSN_CPU0, 15, 0, 0, 0},
{RDTSC, CPU_RDTSC, insn_rdtsc, INSN_CPU0, 16, 0, 0, 0},
+   {CR3 load, CPU_CR3_LOAD, insn_cr3_load, INSN_CPU0, 28, 0x3, 0,
+   FIELD_EXIT_QUAL},
+   {CR3 store, CPU_CR3_STORE, insn_cr3_store, INSN_CPU0, 28, 0x13, 0,
+   FIELD_EXIT_QUAL},
+#ifdef __x86_64__
+   {CR8 load, CPU_CR8_LOAD, insn_cr8_load, INSN_CPU0, 28, 0x8, 0,
+   FIELD_EXIT_QUAL},
+   {CR8 store, CPU_CR8_STORE, insn_cr8_store, INSN_CPU0, 28, 0x18, 0,
+   FIELD_EXIT_QUAL},
+#endif
{MONITOR, CPU_MONITOR, insn_monitor, INSN_CPU0, 39, 0, 0, 0},
{PAUSE, CPU_PAUSE, insn_pause, INSN_CPU0, 40, 0, 0, 0},
// Flags for Secondary Processor-Based VM-Execution Controls
@@ -894,6 +916,10 @@ static int insn_intercept_init()
 
ctrl_cpu[0] = vmcs_read(CPU_EXEC_CTRL0);
ctrl_cpu[0] |= CPU_HLT | CPU_INVLPG | CPU_MWAIT | CPU_RDPMC | CPU_RDTSC 
|
+   CPU_CR3_LOAD | CPU_CR3_STORE |
+#ifdef __x86_64__
+   CPU_CR8_LOAD | CPU_CR8_STORE |
+#endif
CPU_MONITOR | CPU_PAUSE | CPU_SECONDARY;
ctrl_cpu[0] = ctrl_cpu_rev[0].clr;
vmcs_write(CPU_EXEC_CTRL0, ctrl_cpu[0]);
-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] VMX: Only use get_stage accessor

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Consistently make sure we are not affected by any compiler reordering
when evaluating the current stage.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 x86/vmx_tests.c | 80 -
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index d0ce365..bf7aa2c 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -415,13 +415,13 @@ static void cr_shadowing_main()
// Test read through
set_stage(0);
guest_cr0 = read_cr0();
-   if (stage == 1)
+   if (get_stage() == 1)
report(Read through CR0, 0);
else
vmcall();
set_stage(1);
guest_cr4 = read_cr4();
-   if (stage == 2)
+   if (get_stage() == 2)
report(Read through CR4, 0);
else
vmcall();
@@ -430,13 +430,13 @@ static void cr_shadowing_main()
guest_cr4 = guest_cr4 ^ (X86_CR4_TSD | X86_CR4_DE);
set_stage(2);
write_cr0(guest_cr0);
-   if (stage == 3)
+   if (get_stage() == 3)
report(Write throuth CR0, 0);
else
vmcall();
set_stage(3);
write_cr4(guest_cr4);
-   if (stage == 4)
+   if (get_stage() == 4)
report(Write through CR4, 0);
else
vmcall();
@@ -444,7 +444,7 @@ static void cr_shadowing_main()
set_stage(4);
vmcall();
cr0 = read_cr0();
-   if (stage != 5) {
+   if (get_stage() != 5) {
if (cr0 == guest_cr0)
report(Read shadowing CR0, 1);
else
@@ -452,7 +452,7 @@ static void cr_shadowing_main()
}
set_stage(5);
cr4 = read_cr4();
-   if (stage != 6) {
+   if (get_stage() != 6) {
if (cr4 == guest_cr4)
report(Read shadowing CR4, 1);
else
@@ -461,13 +461,13 @@ static void cr_shadowing_main()
// Test write shadow (same value with shadow)
set_stage(6);
write_cr0(guest_cr0);
-   if (stage == 7)
+   if (get_stage() == 7)
report(Write shadowing CR0 (same value with shadow), 0);
else
vmcall();
set_stage(7);
write_cr4(guest_cr4);
-   if (stage == 8)
+   if (get_stage() == 8)
report(Write shadowing CR4 (same value with shadow), 0);
else
vmcall();
@@ -478,7 +478,7 @@ static void cr_shadowing_main()
mov %%rsi, %%cr0\n\t
::m(tmp)
:rsi, memory, cc);
-   if (stage != 9)
+   if (get_stage() != 9)
report(Write shadowing different X86_CR0_TS, 0);
else
report(Write shadowing different X86_CR0_TS, 1);
@@ -488,7 +488,7 @@ static void cr_shadowing_main()
mov %%rsi, %%cr0\n\t
::m(tmp)
:rsi, memory, cc);
-   if (stage != 10)
+   if (get_stage() != 10)
report(Write shadowing different X86_CR0_MP, 0);
else
report(Write shadowing different X86_CR0_MP, 1);
@@ -498,7 +498,7 @@ static void cr_shadowing_main()
mov %%rsi, %%cr4\n\t
::m(tmp)
:rsi, memory, cc);
-   if (stage != 11)
+   if (get_stage() != 11)
report(Write shadowing different X86_CR4_TSD, 0);
else
report(Write shadowing different X86_CR4_TSD, 1);
@@ -508,7 +508,7 @@ static void cr_shadowing_main()
mov %%rsi, %%cr4\n\t
::m(tmp)
:rsi, memory, cc);
-   if (stage != 12)
+   if (get_stage() != 12)
report(Write shadowing different X86_CR4_DE, 0);
else
report(Write shadowing different X86_CR4_DE, 1);
@@ -584,31 +584,31 @@ static int cr_shadowing_exit_handler()
switch (get_stage()) {
case 4:
report(Read shadowing CR0, 0);
-   set_stage(stage + 1);
+   set_stage(get_stage() + 1);
break;
case 5:
report(Read shadowing CR4, 0);
-   set_stage(stage + 1);
+   set_stage(get_stage() + 1);
break;
case 6:
report(Write shadowing CR0 (same value), 0);
-   set_stage(stage + 1);
+   set_stage(get_stage() + 1);
break;
case 7:
report(Write shadowing CR4 (same value), 0);
-   set_stage(stage + 1);
+   set_stage(get_stage() + 1);
break;
case 8:
case 9:
// 0x600 encodes mov %esi, %cr0
   

[PATCH 5/5] VMX: Test behavior on set and cleared save/load debug controls

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

This particularly checks the case when debug controls are not to be
loaded/saved on host-guest transitions.

We have to fake results related to IA32_DEBUGCTL as support for this MSR
is missing KVM. The test already contains all bits required once KVM
adds support.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 x86/vmx.h   |   2 ++
 x86/vmx_tests.c | 110 
 2 files changed, 112 insertions(+)

diff --git a/x86/vmx.h b/x86/vmx.h
index 38ec3c5..3c4830f 100644
--- a/x86/vmx.h
+++ b/x86/vmx.h
@@ -326,6 +326,7 @@ enum Reason {
 #define X86_EFLAGS_ZF  0x0040 /* Zero Flag */
 
 enum Ctrl_exi {
+   EXI_SAVE_DBGCTLS= 1UL  2,
EXI_HOST_64 = 1UL  9,
EXI_LOAD_PERF   = 1UL  12,
EXI_INTA= 1UL  15,
@@ -337,6 +338,7 @@ enum Ctrl_exi {
 };
 
 enum Ctrl_ent {
+   ENT_LOAD_DBGCTLS= 1UL  2,
ENT_GUEST_64= 1UL  9,
ENT_LOAD_PAT= 1UL  14,
ENT_LOAD_EFER   = 1UL  15,
diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index d0b67de..0f4cfc2 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -1406,6 +1406,114 @@ static int interrupt_exit_handler(void)
return VMX_TEST_VMEXIT;
 }
 
+static int dbgctls_init(struct vmcs *vmcs)
+{
+   u64 dr7 = 0x402;
+   u64 zero = 0;
+
+   msr_bmp_init();
+   asm volatile(
+   mov %0,%%dr0\n\t
+   mov %0,%%dr1\n\t
+   mov %0,%%dr2\n\t
+   mov %1,%%dr7\n\t
+   : : r (zero), r (dr7));
+   wrmsr(MSR_IA32_DEBUGCTLMSR, 0x1);
+   vmcs_write(GUEST_DR7, 0x404);
+   vmcs_write(GUEST_DEBUGCTL, 0x2);
+
+   vmcs_write(ENT_CONTROLS, vmcs_read(ENT_CONTROLS) | ENT_LOAD_DBGCTLS);
+   vmcs_write(EXI_CONTROLS, vmcs_read(EXI_CONTROLS) | EXI_SAVE_DBGCTLS);
+
+   return VMX_TEST_START;
+}
+
+static void dbgctls_main(void)
+{
+   u64 dr7, debugctl;
+
+   asm volatile(mov %%dr7,%0 : =r (dr7));
+   debugctl = rdmsr(MSR_IA32_DEBUGCTLMSR);
+   debugctl = 0x2; /* KVM does not support DEBUGCTL so far */
+   report(Load debug controls, dr7 == 0x404  debugctl == 0x2);
+
+   dr7 = 0x408;
+   asm volatile(mov %0,%%dr7 : : r (dr7));
+   wrmsr(MSR_IA32_DEBUGCTLMSR, 0x3);
+
+   set_stage(0);
+   vmcall();
+   report(Save debug controls, get_stage() == 1);
+
+   if (ctrl_enter_rev.set  ENT_LOAD_DBGCTLS ||
+   ctrl_exit_rev.set  EXI_SAVE_DBGCTLS) {
+   printf(\tDebug controls are always loaded/saved\n);
+   return;
+   }
+   set_stage(2);
+   vmcall();
+
+   asm volatile(mov %%dr7,%0 : =r (dr7));
+   debugctl = rdmsr(MSR_IA32_DEBUGCTLMSR);
+   debugctl = 0x1; /* no KVM support */
+   report(Guest=host debug controls, dr7 == 0x402  debugctl == 0x1);
+
+   dr7 = 0x408;
+   asm volatile(mov %0,%%dr7 : : r (dr7));
+   wrmsr(MSR_IA32_DEBUGCTLMSR, 0x3);
+
+   set_stage(3);
+   vmcall();
+   report(Don't save debug controls, get_stage() == 4);
+}
+
+static int dbgctls_exit_handler(void)
+{
+   unsigned int reason = vmcs_read(EXI_REASON)  0xff;
+   u32 insn_len = vmcs_read(EXI_INST_LEN);
+   u64 guest_rip = vmcs_read(GUEST_RIP);
+   u64 dr7, debugctl;
+
+   asm volatile(mov %%dr7,%0 : =r (dr7));
+   debugctl = rdmsr(MSR_IA32_DEBUGCTLMSR);
+
+   switch (reason) {
+   case VMX_VMCALL:
+   switch (get_stage()) {
+   case 0:
+   if (dr7 == 0x400  debugctl == 0 
+   vmcs_read(GUEST_DR7) == 0x408 
+   vmcs_read(GUEST_DEBUGCTL) == /* 0x3 no KVM 
support*/ 0x2)
+   set_stage(1);
+   break;
+   case 2:
+   dr7 = 0x402;
+   asm volatile(mov %0,%%dr7 : : r (dr7));
+   wrmsr(MSR_IA32_DEBUGCTLMSR, 0x1);
+   vmcs_write(GUEST_DR7, 0x404);
+   vmcs_write(GUEST_DEBUGCTL, 0x2);
+
+   vmcs_write(ENT_CONTROLS,
+   vmcs_read(ENT_CONTROLS)  ~ENT_LOAD_DBGCTLS);
+   vmcs_write(EXI_CONTROLS,
+   vmcs_read(EXI_CONTROLS)  ~EXI_SAVE_DBGCTLS);
+   break;
+   case 3:
+   if (dr7 == 0x400  debugctl == 0 
+   vmcs_read(GUEST_DR7) == 0x404 
+   vmcs_read(GUEST_DEBUGCTL) == 0x2)
+   set_stage(4);
+   break;
+   }
+   vmcs_write(GUEST_RIP, guest_rip + insn_len);
+   return VMX_TEST_RESUME;
+   default:
+   printf(Unknown exit reason, %d\n, reason);
+   print_vmexit_info();
+   }
+   return VMX_TEST_VMEXIT;
+}
+
 

[PATCH 3/5] VMX: Test both interception and execution of instructions

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Extend the instruction interception test to also check for
interception-free execution.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 x86/vmx_tests.c | 121 +---
 1 file changed, 72 insertions(+), 49 deletions(-)

diff --git a/x86/vmx_tests.c b/x86/vmx_tests.c
index bf7aa2c..d0b67de 100644
--- a/x86/vmx_tests.c
+++ b/x86/vmx_tests.c
@@ -818,7 +818,6 @@ static int iobmp_exit_handler()
 #define INSN_CPU0  0
 #define INSN_CPU1  1
 #define INSN_ALWAYS_TRAP   2
-#define INSN_NEVER_TRAP3
 
 #define FIELD_EXIT_QUAL0x1
 #define FIELD_INSN_INFO0x2
@@ -829,7 +828,7 @@ asm(
insn_mwait: mwait;ret\n\t
insn_rdpmc: rdpmc;ret\n\t
insn_rdtsc: rdtsc;ret\n\t
-   insn_cr3_load: mov %rax,%cr3;ret\n\t
+   insn_cr3_load: mov cr3,%rax; mov %rax,%cr3;ret\n\t
insn_cr3_store: mov %cr3,%rax;ret\n\t
 #ifdef __x86_64__
insn_cr8_load: mov %rax,%cr8;ret\n\t
@@ -859,6 +858,7 @@ extern void insn_cpuid();
 extern void insn_invd();
 
 u32 cur_insn;
+u64 cr3;
 
 struct insn_table {
const char *name;
@@ -912,55 +912,56 @@ static struct insn_table insn_table[] = {
 
 static int insn_intercept_init()
 {
-   u32 ctrl_cpu[2];
+   u32 ctrl_cpu;
 
-   ctrl_cpu[0] = vmcs_read(CPU_EXEC_CTRL0);
-   ctrl_cpu[0] |= CPU_HLT | CPU_INVLPG | CPU_MWAIT | CPU_RDPMC | CPU_RDTSC 
|
-   CPU_CR3_LOAD | CPU_CR3_STORE |
-#ifdef __x86_64__
-   CPU_CR8_LOAD | CPU_CR8_STORE |
-#endif
-   CPU_MONITOR | CPU_PAUSE | CPU_SECONDARY;
-   ctrl_cpu[0] = ctrl_cpu_rev[0].clr;
-   vmcs_write(CPU_EXEC_CTRL0, ctrl_cpu[0]);
-   ctrl_cpu[1] = vmcs_read(CPU_EXEC_CTRL1);
-   ctrl_cpu[1] |= CPU_WBINVD | CPU_RDRAND;
-   ctrl_cpu[1] = ctrl_cpu_rev[1].clr;
-   vmcs_write(CPU_EXEC_CTRL1, ctrl_cpu[1]);
+   ctrl_cpu = ctrl_cpu_rev[0].set | CPU_SECONDARY;
+   ctrl_cpu = ctrl_cpu_rev[0].clr;
+   vmcs_write(CPU_EXEC_CTRL0, ctrl_cpu);
+   vmcs_write(CPU_EXEC_CTRL1, ctrl_cpu_rev[1].set);
+   cr3 = read_cr3();
return VMX_TEST_START;
 }
 
 static void insn_intercept_main()
 {
-   cur_insn = 0;
-   while(insn_table[cur_insn].name != NULL) {
-   set_stage(cur_insn);
-   if ((insn_table[cur_insn].type == INSN_CPU0
-!(ctrl_cpu_rev[0].clr  insn_table[cur_insn].flag))
-   || (insn_table[cur_insn].type == INSN_CPU1
-!(ctrl_cpu_rev[1].clr  insn_table[cur_insn].flag))) 
{
-   printf(\tCPU_CTRL1.CPU_%s is not supported.\n,
-   insn_table[cur_insn].name);
+   char msg[80];
+
+   for (cur_insn = 0; insn_table[cur_insn].name != NULL; cur_insn++) {
+   set_stage(cur_insn * 2);
+   if ((insn_table[cur_insn].type == INSN_CPU0 
+!(ctrl_cpu_rev[0].clr  insn_table[cur_insn].flag)) ||
+   (insn_table[cur_insn].type == INSN_CPU1 
+!(ctrl_cpu_rev[1].clr  insn_table[cur_insn].flag))) {
+   printf(\tCPU_CTRL%d.CPU_%s is not supported.\n,
+  insn_table[cur_insn].type - INSN_CPU0,
+  insn_table[cur_insn].name);
continue;
}
+
+   if ((insn_table[cur_insn].type == INSN_CPU0 
+!(ctrl_cpu_rev[0].set  insn_table[cur_insn].flag)) ||
+   (insn_table[cur_insn].type == INSN_CPU1 
+!(ctrl_cpu_rev[1].set  insn_table[cur_insn].flag))) {
+   /* skip hlt, it stalls the guest and is tested below */
+   if (insn_table[cur_insn].insn_func != insn_hlt)
+   insn_table[cur_insn].insn_func();
+   snprintf(msg, sizeof(msg), execute %s,
+insn_table[cur_insn].name);
+   report(msg, get_stage() == cur_insn * 2);
+   } else if (insn_table[cur_insn].type != INSN_ALWAYS_TRAP)
+   printf(\tCPU_CTRL%d.CPU_%s always traps.\n,
+  insn_table[cur_insn].type - INSN_CPU0,
+  insn_table[cur_insn].name);
+
+   vmcall();
+
insn_table[cur_insn].insn_func();
-   switch (insn_table[cur_insn].type) {
-   case INSN_CPU0:
-   case INSN_CPU1:
-   case INSN_ALWAYS_TRAP:
-   if (get_stage() != cur_insn + 1)
-   report(insn_table[cur_insn].name, 0);
-   else
-   report(insn_table[cur_insn].name, 1);
-   break;
-   case INSN_NEVER_TRAP:
-   if (get_stage() == cur_insn + 1)
-  

[PATCH 4/5] VMX: Validate capability MSRs

2014-06-15 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Check for required-0 or required-1 bits as well as known field value
restrictions. Also check the consistency between VMX_*_CTLS and
VMX_TRUE_*_CTLS and between CR0/4_FIXED0 and CR0/4_FIXED1.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 x86/vmx.c | 73 ++-
 x86/vmx.h |  5 +++--
 2 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/x86/vmx.c b/x86/vmx.c
index 1182eef..84c514b 100644
--- a/x86/vmx.c
+++ b/x86/vmx.c
@@ -635,6 +635,76 @@ static void test_vmptrst(void)
report(test vmptrst, (!ret)  (vmcs1 == vmcs2));
 }
 
+struct vmx_ctl_msr {
+   const char *name;
+   u32 index, true_index;
+   u32 default1;
+} vmx_ctl_msr[] = {
+   { MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PINBASED_CTLS,
+ MSR_IA32_VMX_TRUE_PIN, 0x16 },
+   { MSR_IA32_VMX_PROCBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
+ MSR_IA32_VMX_TRUE_PROC, 0x401e172 },
+   { MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_PROCBASED_CTLS2,
+ MSR_IA32_VMX_PROCBASED_CTLS2, 0 },
+   { MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_EXIT_CTLS,
+ MSR_IA32_VMX_TRUE_EXIT, 0x36dff },
+   { MSR_IA32_VMX_ENTRY_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
+ MSR_IA32_VMX_TRUE_ENTRY, 0x11ff },
+};
+
+static void test_vmx_caps(void)
+{
+   u64 val, true_val, default1, fixed0, fixed1;
+   unsigned int n;
+   bool ok;
+
+   printf(\nTest suite: VMX capability reporting\n);
+
+   report(MSR_IA32_VMX_BASIC,
+  (basic.revision  (1ul  31)) == 0 
+  basic.size  0  basic.size = 4096 
+  (basic.type == 0 || basic.type == 6) 
+  basic.reserved1 == 0  basic.reserved2 == 0);
+
+   val = rdmsr(MSR_IA32_VMX_MISC);
+   report(MSR_IA32_VMX_MISC,
+  (!(ctrl_cpu_rev[1].clr  CPU_URG) || val  (1ul  5)) 
+  ((val  16)  0x1ff) = 256 
+  (val  0xc0007e00) == 0);
+
+   for (n = 0; n  ARRAY_SIZE(vmx_ctl_msr); n++) {
+   val = rdmsr(vmx_ctl_msr[n].index);
+   default1 = vmx_ctl_msr[n].default1;
+   ok = (val  default1) == default1 
+   u32)val) ^ (val  32))  ~(val  32)) == 0;
+   if (ok  basic.ctrl) {
+   true_val = rdmsr(vmx_ctl_msr[n].true_index);
+   ok = (val  32) == (true_val  32) 
+   ((u32)(val ^ true_val)  ~default1) == 0;
+   }
+   report(vmx_ctl_msr[n].name, ok);
+   }
+
+   fixed0 = rdmsr(MSR_IA32_VMX_CR0_FIXED0);
+   fixed1 = rdmsr(MSR_IA32_VMX_CR0_FIXED1);
+   report(MSR_IA32_VMX_IA32_VMX_CR0_FIXED0/1,
+  ((fixed0 ^ fixed1)  ~fixed1) == 0);
+
+   fixed0 = rdmsr(MSR_IA32_VMX_CR4_FIXED0);
+   fixed1 = rdmsr(MSR_IA32_VMX_CR4_FIXED1);
+   report(MSR_IA32_VMX_IA32_VMX_CR4_FIXED0/1,
+  ((fixed0 ^ fixed1)  ~fixed1) == 0);
+
+   val = rdmsr(MSR_IA32_VMX_VMCS_ENUM);
+   report(MSR_IA32_VMX_VMCS_ENUM,
+  (val  0x3e) = 0x2a 
+  (val  0xfc01Ull) == 0);
+
+   val = rdmsr(MSR_IA32_VMX_EPT_VPID_CAP);
+   report(MSR_IA32_VMX_EPT_VPID_CAP,
+  (val  0xf07ef9eebebeUll) == 0);
+}
+
 /* This function can only be called in guest */
 static void __attribute__((__used__)) hypercall(u32 hypercall_no)
 {
@@ -777,7 +847,7 @@ static int test_run(struct vmx_test *test)
regs = test-guest_regs;
vmcs_write(GUEST_RFLAGS, regs.rflags | 0x2);
launched = 0;
-   printf(\nTest suite : %s\n, test-name);
+   printf(\nTest suite: %s\n, test-name);
vmx_run();
if (vmx_off()) {
printf(%s : vmxoff failed.\n, __func__);
@@ -816,6 +886,7 @@ int main(void)
goto exit;
}
test_vmxoff();
+   test_vmx_caps();
 
while (vmx_tests[++i].name != NULL)
if (test_run(vmx_tests[i]))
diff --git a/x86/vmx.h b/x86/vmx.h
index 69a5385..38ec3c5 100644
--- a/x86/vmx.h
+++ b/x86/vmx.h
@@ -46,12 +46,13 @@ union vmx_basic {
struct {
u32 revision;
u32 size:13,
-   : 3,
+   reserved1: 3,
width:1,
dual:1,
type:4,
insouts:1,
-   ctrl:1;
+   ctrl:1,
+   reserved2:8;
};
 };
 
-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] kvm-unit-tests: more instr. interceptions, debug control migration

2014-06-15 Thread Jan Kiszka
The tests corresponding to (and going beyond) the issues fixed in
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/123282

Jan Kiszka (5):
  VMX: Add tests for CR3 and CR8 interception
  VMX: Only use get_stage accessor
  VMX: Test both interception and execution of instructions
  VMX: Validate capability MSRs
  VMX: Test behavior on set and cleared save/load debug controls

 x86/vmx.c   |  73 -
 x86/vmx.h   |   9 +-
 x86/vmx_tests.c | 327 +---
 3 files changed, 322 insertions(+), 87 deletions(-)

-- 
1.8.1.1.298.ge7eed54

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] KVM: PPC: BOOK3S: HV: Use base page size when comparing against slb value

2014-06-15 Thread Aneesh Kumar K.V
With guests supporting Multiple page size per segment (MPSS),
hpte_page_size returns the actual page size used. Add a new function to
return base page size and use that to compare against the the page size
calculated from SLB. Without this patch a hpte lookup can fail since
we are comparing wrong page size in kvmppc_hv_find_lock_hpte.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
Changes from V1:
* Remove obsolete comment from the code
* Update commit message

 arch/powerpc/include/asm/kvm_book3s_64.h | 19 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  2 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  7 ++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 34422be566ce..3d0f3fb9c6b6 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -202,8 +202,10 @@ static inline unsigned long compute_tlbie_rb(unsigned long 
v, unsigned long r,
return rb;
 }
 
-static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
+static inline unsigned long __hpte_page_size(unsigned long h, unsigned long l,
+bool is_base_size)
 {
+
int size, a_psize;
/* Look at the 8 bit LP value */
unsigned int lp = (l  LP_SHIFT)  ((1  LP_BITS) - 1);
@@ -218,14 +220,27 @@ static inline unsigned long hpte_page_size(unsigned long 
h, unsigned long l)
continue;
 
a_psize = __hpte_actual_psize(lp, size);
-   if (a_psize != -1)
+   if (a_psize != -1) {
+   if (is_base_size)
+   return 1ul  
mmu_psize_defs[size].shift;
return 1ul  mmu_psize_defs[a_psize].shift;
+   }
}
 
}
return 0;
 }
 
+static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
+{
+   return __hpte_page_size(h, l, 0);
+}
+
+static inline unsigned long hpte_base_page_size(unsigned long h, unsigned long 
l)
+{
+   return __hpte_page_size(h, l, 1);
+}
+
 static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
 {
return ((ptel  HPTE_R_RPN)  ~(psize - 1))  PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index f53cf2eae36a..7ff45ed27c65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1567,7 +1567,7 @@ static ssize_t kvm_htab_write(struct file *file, const 
char __user *buf,
goto out;
}
if (!rma_setup  is_vrma_hpte(v)) {
-   unsigned long psize = hpte_page_size(v, r);
+   unsigned long psize = hpte_base_page_size(v, r);
unsigned long senc = slb_pgsize_encoding(psize);
unsigned long lpcr;
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 87624ab5ba82..d86356bfc970 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -833,13 +833,10 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t 
eaddr, unsigned long slb_v,
r = be64_to_cpu(hpte[i+1]);
 
/*
-* Check the HPTE again, including large page size
-* Since we don't currently allow any MPSS (mixed
-* page-size segment) page sizes, it is sufficient
-* to check against the actual page size.
+* Check the HPTE again, including base page size
 */
if ((v  valid)  (v  mask) == val 
-   hpte_page_size(v, r) == (1ul  pshift))
+   hpte_base_page_size(v, r) == (1ul  pshift))
/* Return with the HPTE still locked */
return (hash  3) + (i  1);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:

2014-06-15 Thread Angela D.Dawes

This is a personal email directed to you. My wife and I have a gift
donation for you, to know more details and claims, kindly contact us at:
d.angelada...@outlook.com

Regards,
Dave  Angela Dawes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:

2014-06-15 Thread Angela D.Dawes

This is a personal email directed to you. My wife and I have a gift
donation for you, to know more details and claims, kindly contact us at:
d.angelada...@outlook.com

Regards,
Dave  Angela Dawes

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] KVM: PPC: Book3S HV: Enable on little endian hosts

2014-06-15 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 So far we've been able to successfully run HV KVM on big endian hosts, but
 once you dive into little endian land things start to fall apart.

 This patch set enables HV KVM for little endian hosts. This should be the
 final piece left missing to get little endian systems fully en par with big
 endian ones in the KVM world.

Can we also use types like __be64 where we need them ? That will also
make sure that tools like sparse can catch errors ?

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] DMA, CMA: clean-up log message

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 11:53:16AM +0200, Michal Nazarewicz wrote:
 On Thu, Jun 12 2014, Michal Nazarewicz min...@mina86.com wrote:
  I used “function(arg1, arg2, …)” at the *beginning* of functions when
  the arguments passed to the function were included in the message.  In
  all other cases I left it at just “function:” (or just no additional
  prefix).  IMO that's a reasonable strategy.
 
 At closer inspection, I realised drivers/base/dma-contiguous.c is
 Marek's code, but the above I think is still reasonable thing to do, so
 I'd rather standardise on having “function(…)” only at the beginning of
 a function.  Just my 0.02 CHF.

Hello,

Now, I realize that these changes aren't needed in this patchset, so I
simplify this patch just to remove redundant 'CMA' prefix. Other things
can be done after merging if we need.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/10] DMA, CMA: support alignment constraint on cma region

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 12:02:38PM +0200, Michal Nazarewicz wrote:
 On Thu, Jun 12 2014, Joonsoo Kim iamjoonsoo@lge.com wrote:
  ppc kvm's cma area management needs alignment constraint on
 
 I've noticed it earlier and cannot seem to get to terms with this.  It
 should IMO be PPC, KVM and CMA since those are acronyms.  But if you
 have strong feelings, it's not a big issue.

Yes, I will fix it.

 
  cma region. So support it to prepare generalization of cma area
  management functionality.
 
  Additionally, add some comments which tell us why alignment
  constraint is needed on cma region.
 
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 Acked-by: Michal Nazarewicz min...@mina86.com
 
  diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
  index 8a44c82..bc4c171 100644
  --- a/drivers/base/dma-contiguous.c
  +++ b/drivers/base/dma-contiguous.c
  @@ -219,6 +220,7 @@ core_initcall(cma_init_reserved_areas);
* @size: Size of the reserved area (in bytes),
* @base: Base address of the reserved area optional, use 0 for any
* @limit: End address of the reserved memory (optional, 0 for any).
  + * @alignment: Alignment for the contiguous memory area, should be
  power of 2
 
 “must be power of 2 or zero”.

Okay.

* @res_cma: Pointer to store the created cma region.
* @fixed: hint about where to place the reserved area
*
  @@ -233,15 +235,15 @@ core_initcall(cma_init_reserved_areas);
*/
   static int __init __dma_contiguous_reserve_area(phys_addr_t size,
  phys_addr_t base, phys_addr_t limit,
  +   phys_addr_t alignment,
  struct cma **res_cma, bool fixed)
   {
  struct cma *cma = cma_areas[cma_area_count];
  -   phys_addr_t alignment;
  int ret = 0;
   
  -   pr_debug(%s(size %lx, base %08lx, limit %08lx)\n, __func__,
  -(unsigned long)size, (unsigned long)base,
  -(unsigned long)limit);
  +   pr_debug(%s(size %lx, base %08lx, limit %08lx align_order %08lx)\n,
  +   __func__, (unsigned long)size, (unsigned long)base,
  +   (unsigned long)limit, (unsigned long)alignment);
 
 Nit: Align with the rest of the arguments, i.e.:
 
 + pr_debug(%s(size %lx, base %08lx, limit %08lx align_order %08lx)\n,
 +  __func__, (unsigned long)size, (unsigned long)base,
 +  (unsigned long)limit, (unsigned long)alignment);

What's the difference between mine and yours?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 12:19:54PM +0200, Michal Nazarewicz wrote:
 On Thu, Jun 12 2014, Joonsoo Kim iamjoonsoo@lge.com wrote:
  ppc kvm's cma region management requires arbitrary bitmap granularity,
  since they want to reserve very large memory and manage this region
  with bitmap that one bit for several pages to reduce management overheads.
  So support arbitrary bitmap granularity for following generalization.
 
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 Acked-by: Michal Nazarewicz min...@mina86.com
 
  diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
  index bc4c171..9bc9340 100644
  --- a/drivers/base/dma-contiguous.c
  +++ b/drivers/base/dma-contiguous.c
  @@ -38,6 +38,7 @@ struct cma {
  unsigned long   base_pfn;
  unsigned long   count;
 
 Have you considered replacing count with maxno?

No, I haven't.
I think that count is better than maxno, since it represent number of
pages in this region.

 
  unsigned long   *bitmap;
  +   int order_per_bit; /* Order of pages represented by one bit */
 
 I'd make it unsigned.

Will fix it.

  struct mutexlock;
   };
   
  +static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int
  count)
 
 For consistency cma_clear_bitmap would make more sense I think.  On the
 other hand, you're just moving stuff around so perhaps renaming the
 function at this point is not worth it any more.

Will fix it.

  +{
  +   unsigned long bitmapno, nr_bits;
  +
  +   bitmapno = (pfn - cma-base_pfn)  cma-order_per_bit;
  +   nr_bits = cma_bitmap_pages_to_bits(cma, count);
  +
  +   mutex_lock(cma-lock);
  +   bitmap_clear(cma-bitmap, bitmapno, nr_bits);
  +   mutex_unlock(cma-lock);
  +}
  +
   static int __init cma_activate_area(struct cma *cma)
   {
  -   int bitmap_size = BITS_TO_LONGS(cma-count) * sizeof(long);
  +   int bitmap_maxno = cma_bitmap_maxno(cma);
  +   int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
  unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
  unsigned i = cma-count  pageblock_order;
  struct zone *zone;
 
 bitmap_maxno is never used again, perhaps:
 
 + int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
 
 instead? Up to you.

Okay!!

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/10] DMA, CMA: separate core cma management codes from DMA APIs

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 02:37:43PM +0900, Minchan Kim wrote:
 On Thu, Jun 12, 2014 at 12:21:40PM +0900, Joonsoo Kim wrote:
  To prepare future generalization work on cma area management code,
  we need to separate core cma management codes from DMA APIs.
  We will extend these core functions to cover requirements of
  ppc kvm's cma area management functionality in following patches.
  This separation helps us not to touch DMA APIs while extending
  core functions.
  
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
  
  diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
  index fb0cdce..8a44c82 100644
  --- a/drivers/base/dma-contiguous.c
  +++ b/drivers/base/dma-contiguous.c
  @@ -231,9 +231,9 @@ core_initcall(cma_init_reserved_areas);
* If @fixed is true, reserve contiguous area at exactly @base.  If false,
* reserve in range from @base to @limit.
*/
  -int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
  -  phys_addr_t limit, struct cma **res_cma,
  -  bool fixed)
  +static int __init __dma_contiguous_reserve_area(phys_addr_t size,
  +   phys_addr_t base, phys_addr_t limit,
  +   struct cma **res_cma, bool fixed)
   {
  struct cma *cma = cma_areas[cma_area_count];
  phys_addr_t alignment;
  @@ -288,16 +288,30 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
  size, phys_addr_t base,
   
  pr_info(%s(): reserved %ld MiB at %08lx\n,
  __func__, (unsigned long)size / SZ_1M, (unsigned long)base);
  -
  -   /* Architecture specific contiguous memory fixup. */
  -   dma_contiguous_early_fixup(base, size);
  return 0;
  +
   err:
  pr_err(%s(): failed to reserve %ld MiB\n,
  __func__, (unsigned long)size / SZ_1M);
  return ret;
   }
   
  +int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
  +  phys_addr_t limit, struct cma **res_cma,
  +  bool fixed)
  +{
  +   int ret;
  +
  +   ret = __dma_contiguous_reserve_area(size, base, limit, res_cma, fixed);
  +   if (ret)
  +   return ret;
  +
  +   /* Architecture specific contiguous memory fixup. */
  +   dma_contiguous_early_fixup(base, size);
 
 In old, base and size are aligned with alignment and passed into arch fixup
 but your patch is changing it.
 I didn't look at what kinds of side effect it makes but just want to confirm.

Good catch!!!
I will fix it.

  +
  +   return 0;
  +}
  +
   static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
   {
  mutex_lock(cma-lock);
  @@ -316,20 +330,16 @@ static void clear_cma_bitmap(struct cma *cma, 
  unsigned long pfn, int count)
* global one. Requires architecture specific dev_get_cma_area() helper
* function.
*/
  -struct page *dma_alloc_from_contiguous(struct device *dev, int count,
  +static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
 unsigned int align)
   {
  unsigned long mask, pfn, pageno, start = 0;
  -   struct cma *cma = dev_get_cma_area(dev);
  struct page *page = NULL;
  int ret;
   
  if (!cma || !cma-count)
  return NULL;
   
  -   if (align  CONFIG_CMA_ALIGNMENT)
  -   align = CONFIG_CMA_ALIGNMENT;
  -
  pr_debug(%s(cma %p, count %d, align %d)\n, __func__, (void *)cma,
   count, align);
   
  @@ -377,6 +387,17 @@ struct page *dma_alloc_from_contiguous(struct device 
  *dev, int count,
  return page;
   }
   
 
 Please move the description in __dma_alloc_from_contiguous to here exported 
 API.
 

Okay.

  +struct page *dma_alloc_from_contiguous(struct device *dev, int count,
  +  unsigned int align)
  +{
  +   struct cma *cma = dev_get_cma_area(dev);
  +
  +   if (align  CONFIG_CMA_ALIGNMENT)
  +   align = CONFIG_CMA_ALIGNMENT;
  +
  +   return __dma_alloc_from_contiguous(cma, count, align);
  +}
  +
   /**
* dma_release_from_contiguous() - release allocated pages
* @dev:   Pointer to device for which the pages were allocated.
  @@ -387,10 +408,9 @@ struct page *dma_alloc_from_contiguous(struct device 
  *dev, int count,
* It returns false when provided pages do not belong to contiguous area 
  and
* true otherwise.
*/
  -bool dma_release_from_contiguous(struct device *dev, struct page *pages,
  +static bool __dma_release_from_contiguous(struct cma *cma, struct page 
  *pages,
   int count)
   {
  -   struct cma *cma = dev_get_cma_area(dev);
  unsigned long pfn;
   
  if (!cma || !pages)
  @@ -410,3 +430,11 @@ bool dma_release_from_contiguous(struct device *dev, 
  struct page *pages,
   
  return true;
   }
  +
 
 Ditto.
 
Okay.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

Re: [PATCH v2 06/10] CMA: generalize CMA reserved area management functionality

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 03:46:44PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Currently, there are two users on CMA functionality, one is the DMA
  subsystem and the other is the kvm on powerpc. They have their own code
  to manage CMA reserved area even if they looks really similar.
  From my guess, it is caused by some needs on bitmap management. Kvm side
  wants to maintain bitmap not for 1 page, but for more size. Eventually it
  use bitmap where one bit represents 64 pages.
 
  When I implement CMA related patches, I should change those two places
  to apply my change and it seem to be painful to me. I want to change
  this situation and reduce future code management overhead through
  this patch.
 
  This change could also help developer who want to use CMA in their
  new feature development, since they can use CMA easily without
  copying  pasting this reserved area management code.
 
  In previous patches, we have prepared some features to generalize
  CMA reserved area management and now it's time to do it. This patch
  moves core functions to mm/cma.c and change DMA APIs to use
  these functions.
 
  There is no functional change in DMA APIs.
 
  v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
 
  Acked-by: Michal Nazarewicz min...@mina86.com
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 
 .
 
  +
  +   mask = cma_bitmap_aligned_mask(cma, align);
  +   bitmap_maxno = cma_bitmap_maxno(cma);
  +   nr_bits = cma_bitmap_pages_to_bits(cma, count);
  +
  +   for (;;) {
  +   mutex_lock(cma-lock);
  +   bitmapno = bitmap_find_next_zero_area(cma-bitmap,
  +   bitmap_maxno, start, nr_bits, mask);
  +   if (bitmapno = bitmap_maxno) {
  +   mutex_unlock(cma-lock);
  +   break;
  +   }
  +   bitmap_set(cma-bitmap, bitmapno, nr_bits);
  +   /*
  +* It's safe to drop the lock here. We've marked this region for
  +* our exclusive use. If the migration fails we will take the
  +* lock again and unmark it.
  +*/
  +   mutex_unlock(cma-lock);
  +
  +   pfn = cma-base_pfn + (bitmapno  cma-order_per_bit);
  +   mutex_lock(cma_mutex);
  +   ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
  +   mutex_unlock(cma_mutex);
  +   if (ret == 0) {
  +   page = pfn_to_page(pfn);
  +   break;
  +   } else if (ret != -EBUSY) {
  +   clear_cma_bitmap(cma, pfn, count);
  +   break;
  +   }
  +   
 
 
 For setting bit map we do
   bitmap_set(cma-bitmap, bitmapno, nr_bits);
 alloc_contig()..
 if (error)
 clear_cma_bitmap(cma, pfn, count);
 
 Why ?
 
 why not bitmap_clear() ?
 

Unlike your psuedo code, for setting bitmap, we do
- grab the mutex
- bitmap_set
- release the mutex

clear_cma_bitmap() handles these things.

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] PPC, KVM, CMA: use general CMA reserved area management framework

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 03:35:33PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Now, we have general CMA reserved area management framework,
  so use it for future maintainabilty. There is no functional change.
 
  Acked-by: Michal Nazarewicz min...@mina86.com
  Acked-by: Paolo Bonzini pbonz...@redhat.com
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
  diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
  b/arch/powerpc/kvm/book3s_hv_builtin.c
  index 7cde8a6..28ec226 100644
  --- a/arch/powerpc/kvm/book3s_hv_builtin.c
  +++ b/arch/powerpc/kvm/book3s_hv_builtin.c
  @@ -16,12 +16,14 @@
   #include linux/init.h
   #include linux/memblock.h
   #include linux/sizes.h
  +#include linux/cma.h
 
   #include asm/cputable.h
   #include asm/kvm_ppc.h
   #include asm/kvm_book3s.h
 
  -#include book3s_hv_cma.h
  +#define KVM_CMA_CHUNK_ORDER18
  +
   /*
* Hash page table alignment on newer cpus(CPU_FTR_ARCH_206)
* should be power of 2.
  @@ -43,6 +45,8 @@ static unsigned long kvm_cma_resv_ratio = 5;
   unsigned long kvm_rma_pages = (1  27)  PAGE_SHIFT; /* 128MB */
   EXPORT_SYMBOL_GPL(kvm_rma_pages);
 
  +static struct cma *kvm_cma;
  +
   /* Work out RMLS (real mode limit selector) field value for a given RMA 
  size.
  Assumes POWER7 or PPC970. */
   static inline int lpcr_rmls(unsigned long rma_size)
  @@ -97,7 +101,7 @@ struct kvm_rma_info *kvm_alloc_rma()
  ri = kmalloc(sizeof(struct kvm_rma_info), GFP_KERNEL);
  if (!ri)
  return NULL;
  -   page = kvm_alloc_cma(kvm_rma_pages, kvm_rma_pages);
  +   page = cma_alloc(kvm_cma, kvm_rma_pages, get_order(kvm_rma_pages));
  if (!page)
  goto err_out;
  atomic_set(ri-use_count, 1);
  @@ -112,7 +116,7 @@ EXPORT_SYMBOL_GPL(kvm_alloc_rma);
   void kvm_release_rma(struct kvm_rma_info *ri)
   {
  if (atomic_dec_and_test(ri-use_count)) {
  -   kvm_release_cma(pfn_to_page(ri-base_pfn), kvm_rma_pages);
  +   cma_release(kvm_cma, pfn_to_page(ri-base_pfn), kvm_rma_pages);
  kfree(ri);
  }
   }
  @@ -134,13 +138,13 @@ struct page *kvm_alloc_hpt(unsigned long nr_pages)
  /* Old CPUs require HPT aligned on a multiple of its size */
  if (!cpu_has_feature(CPU_FTR_ARCH_206))
  align_pages = nr_pages;
  -   return kvm_alloc_cma(nr_pages, align_pages);
  +   return cma_alloc(kvm_cma, nr_pages, get_order(align_pages));
   }
   EXPORT_SYMBOL_GPL(kvm_alloc_hpt);
 
   void kvm_release_hpt(struct page *page, unsigned long nr_pages)
   {
  -   kvm_release_cma(page, nr_pages);
  +   cma_release(kvm_cma, page, nr_pages);
   }
   EXPORT_SYMBOL_GPL(kvm_release_hpt);
 
  @@ -179,7 +183,8 @@ void __init kvm_cma_reserve(void)
  align_size = HPT_ALIGN_PAGES  PAGE_SHIFT;
 
  align_size = max(kvm_rma_pages  PAGE_SHIFT, align_size);
  -   kvm_cma_declare_contiguous(selected_size, align_size);
  +   cma_declare_contiguous(selected_size, 0, 0, align_size,
  +   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, kvm_cma, false);
  }
   }
 
  diff --git a/arch/powerpc/kvm/book3s_hv_cma.c 
  b/arch/powerpc/kvm/book3s_hv_cma.c
  deleted file mode 100644
  index d9d3d85..000
  --- a/arch/powerpc/kvm/book3s_hv_cma.c
  +++ /dev/null
  @@ -1,240 +0,0 @@
  -/*
  - * Contiguous Memory Allocator for ppc KVM hash pagetable  based on CMA
  - * for DMA mapping framework
  - *
  - * Copyright IBM Corporation, 2013
  - * Author Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
  - *
  - * This program is free software; you can redistribute it and/or
  - * modify it under the terms of the GNU General Public License as
  - * published by the Free Software Foundation; either version 2 of the
  - * License or (at your optional) any later version of the license.
  - *
  - */
  -#define pr_fmt(fmt) kvm_cma:  fmt
  -
  -#ifdef CONFIG_CMA_DEBUG
  -#ifndef DEBUG
  -#  define DEBUG
  -#endif
  -#endif
  -
  -#include linux/memblock.h
  -#include linux/mutex.h
  -#include linux/sizes.h
  -#include linux/slab.h
  -
  -#include book3s_hv_cma.h
  -
  -struct kvm_cma {
  -   unsigned long   base_pfn;
  -   unsigned long   count;
  -   unsigned long   *bitmap;
  -};
  -
  -static DEFINE_MUTEX(kvm_cma_mutex);
  -static struct kvm_cma kvm_cma_area;
  -
  -/**
  - * kvm_cma_declare_contiguous() - reserve area for contiguous memory 
  handling
  - *   for kvm hash pagetable
  - * @size:  Size of the reserved memory.
  - * @alignment:  Alignment for the contiguous memory area
  - *
  - * This function reserves memory for kvm cma area. It should be
  - * called by arch code when early allocator (memblock or bootmem)
  - * is still activate.
  - */
  -long __init kvm_cma_declare_contiguous(phys_addr_t size, phys_addr_t 
  alignment)
  -{
  -   long base_pfn;
  -   phys_addr_t addr;
  -   struct kvm_cma *cma = kvm_cma_area;
  -
  -   pr_debug(%s(size %lx)\n, __func__, (unsigned long)size);
  -
  -   if (!size)

Re: [PATCH v2 00/10] CMA: generalize CMA reserved area management code

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 12:55:39PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Currently, there are two users on CMA functionality, one is the DMA
  subsystem and the other is the kvm on powerpc. They have their own code
  to manage CMA reserved area even if they looks really similar.
  From my guess, it is caused by some needs on bitmap management. Kvm side
  wants to maintain bitmap not for 1 page, but for more size. Eventually it
  use bitmap where one bit represents 64 pages.
 
  When I implement CMA related patches, I should change those two places
  to apply my change and it seem to be painful to me. I want to change
  this situation and reduce future code management overhead through
  this patch.
 
  This change could also help developer who want to use CMA in their
  new feature development, since they can use CMA easily without
  copying  pasting this reserved area management code.
 
  v2:
Although this patchset looks very different with v1, the end result,
that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7.
 
  Patch 1-5 prepare some features to cover ppc kvm's requirements.
  Patch 6-7 generalize CMA reserved area management code and change users
  to use it.
  Patch 8-10 clean-up minor things.
 
 
 I wanted to test the ppc changes and found that the patch series doesn't apply
 against v3.15 . Do you have a kernel tree which I can clone to test this
 series ?

This is based on linux-next -next-20140610.
And my tree is on following link.

https://github.com/JoonsooKim/linux/tree/cma-general-v2.0-next-20140610

But, I think I'm late, because you have already added a Tested-by tag.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] PPC, KVM, CMA: use general CMA reserved area management framework

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 02:23:59PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Now, we have general CMA reserved area management framework,
  so use it for future maintainabilty. There is no functional change.
 
  Acked-by: Michal Nazarewicz min...@mina86.com
  Acked-by: Paolo Bonzini pbonz...@redhat.com
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 Need this. We may want to keep the VM_BUG_ON by moving
 KVM_CMA_CHUNK_ORDER around.
 
 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 8056107..1932e0e 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -37,8 +37,6 @@
  #include asm/ppc-opcode.h
  #include asm/cputable.h
  
 -#include book3s_hv_cma.h
 -
  /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
  #define MAX_LPID_970   63
  
 @@ -64,7 +62,6 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 }
  
 kvm-arch.hpt_cma_alloc = 0;
 -   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
 page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
 if (page) {
 hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
 
 
 
 -aneesh

Okay.
So do you also want this?

@@ -131,16 +135,18 @@ struct page *kvm_alloc_hpt(unsigned long nr_pages)
 {
unsigned long align_pages = HPT_ALIGN_PAGES;
 
+   VM_BUG_ON(get_order(nr_pages)  KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
+
/* Old CPUs require HPT aligned on a multiple of its size */
if (!cpu_has_feature(CPU_FTR_ARCH_206))
align_pages = nr_pages;
-   return kvm_alloc_cma(nr_pages, align_pages);
+   return cma_alloc(kvm_cma, nr_pages, get_order(align_pages));
 }

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 8/9] mm, CMA: change cma_declare_contiguous() to obey coding convention

2014-06-15 Thread Joonsoo Kim
Conventionally, we put output param to the end of param list
and put the 'base' ahead of 'size', but cma_declare_contiguous()
doesn't look like that, so change it.

Additionally, move down cma_areas reference code to the position
where it is really needed.

v3: put 'base' ahead of 'size' (Minchan)

Acked-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 3960e0b..6cf498a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -185,8 +185,8 @@ void __init kvm_cma_reserve(void)
align_size = HPT_ALIGN_PAGES  PAGE_SHIFT;
 
align_size = max(kvm_rma_pages  PAGE_SHIFT, align_size);
-   cma_declare_contiguous(selected_size, 0, 0, align_size,
-   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, kvm_cma, false);
+   cma_declare_contiguous(0, selected_size, 0, align_size,
+   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, false, kvm_cma);
}
 }
 
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 0411c1c..6606abd 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -165,7 +165,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
 {
int ret;
 
-   ret = cma_declare_contiguous(size, base, limit, 0, 0, res_cma, fixed);
+   ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma);
if (ret)
return ret;
 
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 69d3726..32cab7a 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -15,7 +15,7 @@ extern unsigned long cma_get_size(struct cma *cma);
 extern int __init cma_declare_contiguous(phys_addr_t size,
phys_addr_t base, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
-   struct cma **res_cma, bool fixed);
+   bool fixed, struct cma **res_cma);
 extern struct page *cma_alloc(struct cma *cma, int count, unsigned int align);
 extern bool cma_release(struct cma *cma, struct page *pages, int count);
 #endif
diff --git a/mm/cma.c b/mm/cma.c
index b442a13..9961120 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -141,13 +141,13 @@ core_initcall(cma_init_reserved_areas);
 
 /**
  * cma_declare_contiguous() - reserve custom contiguous area
- * @size: Size of the reserved area (in bytes),
  * @base: Base address of the reserved area optional, use 0 for any
+ * @size: Size of the reserved area (in bytes),
  * @limit: End address of the reserved memory (optional, 0 for any).
  * @alignment: Alignment for the CMA area, should be power of 2 or zero
  * @order_per_bit: Order of pages represented by one bit on bitmap.
- * @res_cma: Pointer to store the created cma region.
  * @fixed: hint about where to place the reserved area
+ * @res_cma: Pointer to store the created cma region.
  *
  * This function reserves memory from early allocator. It should be
  * called by arch specific code once the early allocator (memblock or bootmem)
@@ -157,12 +157,12 @@ core_initcall(cma_init_reserved_areas);
  * If @fixed is true, reserve contiguous area at exactly @base.  If false,
  * reserve in range from @base to @limit.
  */
-int __init cma_declare_contiguous(phys_addr_t size,
-   phys_addr_t base, phys_addr_t limit,
+int __init cma_declare_contiguous(phys_addr_t base,
+   phys_addr_t size, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
-   struct cma **res_cma, bool fixed)
+   bool fixed, struct cma **res_cma)
 {
-   struct cma *cma = cma_areas[cma_area_count];
+   struct cma *cma;
int ret = 0;
 
pr_debug(%s(size %lx, base %08lx, limit %08lx alignment %08lx)\n,
@@ -218,6 +218,7 @@ int __init cma_declare_contiguous(phys_addr_t size,
 * Each reserved area must be initialised later, when more kernel
 * subsystems (like slab allocator) are available.
 */
+   cma = cma_areas[cma_area_count];
cma-base_pfn = PFN_DOWN(base);
cma-count = size  PAGE_SHIFT;
cma-order_per_bit = order_per_bit;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 2/9] DMA, CMA: separate core CMA management codes from DMA APIs

2014-06-15 Thread Joonsoo Kim
To prepare future generalization work on CMA area management code,
we need to separate core CMA management codes from DMA APIs.
We will extend these core functions to cover requirements of
PPC KVM's CMA area management functionality in following patches.
This separation helps us not to touch DMA APIs while extending
core functions.

v3: move decriptions to exporeted APIs (Minchan)
pass aligned base and size to dma_contiguous_early_fixup() (Minchan)

Acked-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 6467c91..9021762 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -213,26 +213,9 @@ static int __init cma_init_reserved_areas(void)
 }
 core_initcall(cma_init_reserved_areas);
 
-/**
- * dma_contiguous_reserve_area() - reserve custom contiguous area
- * @size: Size of the reserved area (in bytes),
- * @base: Base address of the reserved area optional, use 0 for any
- * @limit: End address of the reserved memory (optional, 0 for any).
- * @res_cma: Pointer to store the created cma region.
- * @fixed: hint about where to place the reserved area
- *
- * This function reserves memory from early allocator. It should be
- * called by arch specific code once the early allocator (memblock or bootmem)
- * has been activated and all other subsystems have already allocated/reserved
- * memory. This function allows to create custom reserved areas for specific
- * devices.
- *
- * If @fixed is true, reserve contiguous area at exactly @base.  If false,
- * reserve in range from @base to @limit.
- */
-int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
-  phys_addr_t limit, struct cma **res_cma,
-  bool fixed)
+static int __init __dma_contiguous_reserve_area(phys_addr_t size,
+   phys_addr_t base, phys_addr_t limit,
+   struct cma **res_cma, bool fixed)
 {
struct cma *cma = cma_areas[cma_area_count];
phys_addr_t alignment;
@@ -286,15 +269,47 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
 
pr_info(CMA: reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
(unsigned long)base);
-
-   /* Architecture specific contiguous memory fixup. */
-   dma_contiguous_early_fixup(base, size);
return 0;
+
 err:
pr_err(CMA: failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
return ret;
 }
 
+/**
+ * dma_contiguous_reserve_area() - reserve custom contiguous area
+ * @size: Size of the reserved area (in bytes),
+ * @base: Base address of the reserved area optional, use 0 for any
+ * @limit: End address of the reserved memory (optional, 0 for any).
+ * @res_cma: Pointer to store the created cma region.
+ * @fixed: hint about where to place the reserved area
+ *
+ * This function reserves memory from early allocator. It should be
+ * called by arch specific code once the early allocator (memblock or bootmem)
+ * has been activated and all other subsystems have already allocated/reserved
+ * memory. This function allows to create custom reserved areas for specific
+ * devices.
+ *
+ * If @fixed is true, reserve contiguous area at exactly @base.  If false,
+ * reserve in range from @base to @limit.
+ */
+int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
+  phys_addr_t limit, struct cma **res_cma,
+  bool fixed)
+{
+   int ret;
+
+   ret = __dma_contiguous_reserve_area(size, base, limit, res_cma, fixed);
+   if (ret)
+   return ret;
+
+   /* Architecture specific contiguous memory fixup. */
+   dma_contiguous_early_fixup(PFN_PHYS((*res_cma)-base_pfn),
+   (*res_cma)-count  PAGE_SHIFT);
+
+   return 0;
+}
+
 static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 {
mutex_lock(cma-lock);
@@ -302,31 +317,16 @@ static void clear_cma_bitmap(struct cma *cma, unsigned 
long pfn, int count)
mutex_unlock(cma-lock);
 }
 
-/**
- * dma_alloc_from_contiguous() - allocate pages from contiguous area
- * @dev:   Pointer to device for which the allocation is performed.
- * @count: Requested number of pages.
- * @align: Requested alignment of pages (in PAGE_SIZE order).
- *
- * This function allocates memory buffer for specified device. It uses
- * device specific contiguous memory area if available or the default
- * global one. Requires architecture specific dev_get_cma_area() helper
- * function.
- */
-struct page *dma_alloc_from_contiguous(struct device *dev, int count,
+static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
   

[PATCH v3 -next 6/9] PPC, KVM, CMA: use general CMA reserved area management framework

2014-06-15 Thread Joonsoo Kim
Now, we have general CMA reserved area management framework,
so use it for future maintainabilty. There is no functional change.

v3: add zeroing to CMA region (Aneesh)
fix compile error (Aneesh)
move VM_BUG_ON() to kvm_alloc_hpt() in book3s_hv_builtin.c (Aneesh)

Acked-by: Michal Nazarewicz min...@mina86.com
Acked-by: Paolo Bonzini pbonz...@redhat.com
Tested-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8056107..a41e625 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -37,8 +37,6 @@
 #include asm/ppc-opcode.h
 #include asm/cputable.h
 
-#include book3s_hv_cma.h
-
 /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
 #define MAX_LPID_970   63
 
@@ -64,10 +62,10 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
}
 
kvm-arch.hpt_cma_alloc = 0;
-   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
if (page) {
hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
+   memset((void *)hpt, 0, (1  order));
kvm-arch.hpt_cma_alloc = 1;
}
 
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 7cde8a6..3960e0b 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -16,12 +16,14 @@
 #include linux/init.h
 #include linux/memblock.h
 #include linux/sizes.h
+#include linux/cma.h
 
 #include asm/cputable.h
 #include asm/kvm_ppc.h
 #include asm/kvm_book3s.h
 
-#include book3s_hv_cma.h
+#define KVM_CMA_CHUNK_ORDER18
+
 /*
  * Hash page table alignment on newer cpus(CPU_FTR_ARCH_206)
  * should be power of 2.
@@ -43,6 +45,8 @@ static unsigned long kvm_cma_resv_ratio = 5;
 unsigned long kvm_rma_pages = (1  27)  PAGE_SHIFT; /* 128MB */
 EXPORT_SYMBOL_GPL(kvm_rma_pages);
 
+static struct cma *kvm_cma;
+
 /* Work out RMLS (real mode limit selector) field value for a given RMA size.
Assumes POWER7 or PPC970. */
 static inline int lpcr_rmls(unsigned long rma_size)
@@ -97,7 +101,7 @@ struct kvm_rma_info *kvm_alloc_rma()
ri = kmalloc(sizeof(struct kvm_rma_info), GFP_KERNEL);
if (!ri)
return NULL;
-   page = kvm_alloc_cma(kvm_rma_pages, kvm_rma_pages);
+   page = cma_alloc(kvm_cma, kvm_rma_pages, get_order(kvm_rma_pages));
if (!page)
goto err_out;
atomic_set(ri-use_count, 1);
@@ -112,7 +116,7 @@ EXPORT_SYMBOL_GPL(kvm_alloc_rma);
 void kvm_release_rma(struct kvm_rma_info *ri)
 {
if (atomic_dec_and_test(ri-use_count)) {
-   kvm_release_cma(pfn_to_page(ri-base_pfn), kvm_rma_pages);
+   cma_release(kvm_cma, pfn_to_page(ri-base_pfn), kvm_rma_pages);
kfree(ri);
}
 }
@@ -131,16 +135,18 @@ struct page *kvm_alloc_hpt(unsigned long nr_pages)
 {
unsigned long align_pages = HPT_ALIGN_PAGES;
 
+   VM_BUG_ON(get_order(nr_pages)  KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
+
/* Old CPUs require HPT aligned on a multiple of its size */
if (!cpu_has_feature(CPU_FTR_ARCH_206))
align_pages = nr_pages;
-   return kvm_alloc_cma(nr_pages, align_pages);
+   return cma_alloc(kvm_cma, nr_pages, get_order(align_pages));
 }
 EXPORT_SYMBOL_GPL(kvm_alloc_hpt);
 
 void kvm_release_hpt(struct page *page, unsigned long nr_pages)
 {
-   kvm_release_cma(page, nr_pages);
+   cma_release(kvm_cma, page, nr_pages);
 }
 EXPORT_SYMBOL_GPL(kvm_release_hpt);
 
@@ -179,7 +185,8 @@ void __init kvm_cma_reserve(void)
align_size = HPT_ALIGN_PAGES  PAGE_SHIFT;
 
align_size = max(kvm_rma_pages  PAGE_SHIFT, align_size);
-   kvm_cma_declare_contiguous(selected_size, align_size);
+   cma_declare_contiguous(selected_size, 0, 0, align_size,
+   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, kvm_cma, false);
}
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv_cma.c b/arch/powerpc/kvm/book3s_hv_cma.c
deleted file mode 100644
index d9d3d85..000
--- a/arch/powerpc/kvm/book3s_hv_cma.c
+++ /dev/null
@@ -1,240 +0,0 @@
-/*
- * Contiguous Memory Allocator for ppc KVM hash pagetable  based on CMA
- * for DMA mapping framework
- *
- * Copyright IBM Corporation, 2013
- * Author Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License as
- * published by the Free Software Foundation; either version 2 of the
- * License or (at your optional) any later version of the license.
- *
- */
-#define pr_fmt(fmt) kvm_cma:  fmt
-
-#ifdef CONFIG_CMA_DEBUG
-#ifndef DEBUG
-#  define DEBUG
-#endif
-#endif
-
-#include linux/memblock.h
-#include linux/mutex.h
-#include linux/sizes.h
-#include linux/slab.h
-

[PATCH v3 -next 7/9] mm, CMA: clean-up CMA allocation error path

2014-06-15 Thread Joonsoo Kim
We can remove one call sites for clear_cma_bitmap() if we first
call it before checking error number.

Acked-by: Minchan Kim minc...@kernel.org
Reviewed-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/cma.c b/mm/cma.c
index 0cf50da..b442a13 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -285,11 +285,12 @@ struct page *cma_alloc(struct cma *cma, int count, 
unsigned int align)
if (ret == 0) {
page = pfn_to_page(pfn);
break;
-   } else if (ret != -EBUSY) {
-   cma_clear_bitmap(cma, pfn, count);
-   break;
}
+
cma_clear_bitmap(cma, pfn, count);
+   if (ret != -EBUSY)
+   break;
+
pr_debug(%s(): memory range at %p is busy, retrying\n,
 __func__, pfn_to_page(pfn));
/* try again with a bit different memory target */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 5/9] CMA: generalize CMA reserved area management functionality

2014-06-15 Thread Joonsoo Kim
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar.
From my guess, it is caused by some needs on bitmap management. KVM side
wants to maintain bitmap not for 1 page, but for more size. Eventually it
use bitmap where one bit represents 64 pages.

When I implement CMA related patches, I should change those two places
to apply my change and it seem to be painful to me. I want to change
this situation and reduce future code management overhead through
this patch.

This change could also help developer who want to use CMA in their
new feature development, since they can use CMA easily without
copying  pasting this reserved area management code.

In previous patches, we have prepared some features to generalize
CMA reserved area management and now it's time to do it. This patch
moves core functions to mm/cma.c and change DMA APIs to use
these functions.

There is no functional change in DMA APIs.

v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
v3: remove log2.h in dma-contiguous.c (Minchan)
add some accessor functions to pass aligned base and size to
dma_contiguous_early_fixup() function
move MAX_CMA_AREAS to cma.h

Acked-by: Michal Nazarewicz min...@mina86.com
Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Acked-by: Minchan Kim minc...@kernel.org
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 4c88935..3116880 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -26,6 +26,7 @@
 #include linux/io.h
 #include linux/vmalloc.h
 #include linux/sizes.h
+#include linux/cma.h
 
 #include asm/memory.h
 #include asm/highmem.h
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 00e13ce..4eac559 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -283,16 +283,6 @@ config CMA_ALIGNMENT
 
  If unsure, leave the default value 8.
 
-config CMA_AREAS
-   int Maximum count of the CMA device-private areas
-   default 7
-   help
- CMA allows to create CMA areas for particular devices. This parameter
- sets the maximum number of such device private CMA areas in the
- system.
-
- If unsure, leave the default value 7.
-
 endif
 
 endmenu
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index c6eeb2c..0411c1c 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -24,25 +24,9 @@
 
 #include linux/memblock.h
 #include linux/err.h
-#include linux/mm.h
-#include linux/mutex.h
-#include linux/page-isolation.h
 #include linux/sizes.h
-#include linux/slab.h
-#include linux/swap.h
-#include linux/mm_types.h
 #include linux/dma-contiguous.h
-#include linux/log2.h
-
-struct cma {
-   unsigned long   base_pfn;
-   unsigned long   count;
-   unsigned long   *bitmap;
-   unsigned int order_per_bit; /* Order of pages represented by one bit */
-   struct mutexlock;
-};
-
-struct cma *dma_contiguous_default_area;
+#include linux/cma.h
 
 #ifdef CONFIG_CMA_SIZE_MBYTES
 #define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
@@ -50,6 +34,8 @@ struct cma *dma_contiguous_default_area;
 #define CMA_SIZE_MBYTES 0
 #endif
 
+struct cma *dma_contiguous_default_area;
+
 /*
  * Default global CMA area size can be defined in kernel's .config.
  * This is useful mainly for distro maintainers to create a kernel
@@ -156,169 +142,6 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
}
 }
 
-static DEFINE_MUTEX(cma_mutex);
-
-static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order)
-{
-   return (1  (align_order  cma-order_per_bit)) - 1;
-}
-
-static unsigned long cma_bitmap_maxno(struct cma *cma)
-{
-   return cma-count  cma-order_per_bit;
-}
-
-static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
-   unsigned long pages)
-{
-   return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
-}
-
-static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count)
-{
-   unsigned long bitmap_no, bitmap_count;
-
-   bitmap_no = (pfn - cma-base_pfn)  cma-order_per_bit;
-   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
-
-   mutex_lock(cma-lock);
-   bitmap_clear(cma-bitmap, bitmap_no, bitmap_count);
-   mutex_unlock(cma-lock);
-}
-
-static int __init cma_activate_area(struct cma *cma)
-{
-   int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
-   unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
-   unsigned i = cma-count  pageblock_order;
-   struct zone *zone;
-
-   cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
-
-   if (!cma-bitmap)
-   return -ENOMEM;
-
-   

[PATCH v3 -next 1/9] DMA, CMA: fix possible memory leak

2014-06-15 Thread Joonsoo Kim
We should free memory for bitmap when we find zone mis-match,
otherwise this memory will leak.

Additionally, I copy code comment from PPC KVM's CMA code to inform
why we need to check zone mis-match.

* Note
Minchan suggested to add a tag for the stable, but, I don't do it,
because I found this possibility during code-review and, IMO,
this patch isn't suitable for stable tree.

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Reviewed-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 83969f8..6467c91 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -176,14 +176,24 @@ static int __init cma_activate_area(struct cma *cma)
base_pfn = pfn;
for (j = pageblock_nr_pages; j; --j, pfn++) {
WARN_ON_ONCE(!pfn_valid(pfn));
+   /*
+* alloc_contig_range requires the pfn range
+* specified to be in the same zone. Make this
+* simple by forcing the entire CMA resv range
+* to be in the same zone.
+*/
if (page_zone(pfn_to_page(pfn)) != zone)
-   return -EINVAL;
+   goto err;
}
init_cma_reserved_pageblock(pfn_to_page(base_pfn));
} while (--i);
 
mutex_init(cma-lock);
return 0;
+
+err:
+   kfree(cma-bitmap);
+   return -EINVAL;
 }
 
 static struct cma cma_areas[MAX_CMA_AREAS];
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code

2014-06-15 Thread Joonsoo Kim
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar.
From my guess, it is caused by some needs on bitmap management. Kvm side
wants to maintain bitmap not for 1 page, but for more size. Eventually it
use bitmap where one bit represents 64 pages.

When I implement CMA related patches, I should change those two places
to apply my change and it seem to be painful to me. I want to change
this situation and reduce future code management overhead through
this patch.

This change could also help developer who want to use CMA in their
new feature development, since they can use CMA easily without
copying  pasting this reserved area management code.

v3:
  - Simplify old patch 1(log format fix) and move it to the end of patchset.
  - Patch 2: Pass aligned base and size to dma_contiguous_early_fixup()
  - Patch 5: Add some accessor functions to pass aligned base and size to
  dma_contiguous_early_fixup() function
  - Patch 5: Move MAX_CMA_AREAS definition to cma.h
  - Patch 6: Add CMA region zeroing to PPC KVM's CMA alloc function
  - Patch 8: put 'base' ahead of 'size' in cma_declare_contiguous()
  - Remaining minor fixes are noted in commit description of each one

v2:
  - Although this patchset looks very different with v1, the end result,
  that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7.

This patchset is based on linux-next 20140610.

Patch 1-4 prepare some features to cover PPC KVM's requirements.
Patch 5-6 generalize CMA reserved area management code and change users
to use it.
Patch 7-9 clean-up minor things.

Joonsoo Kim (9):
  DMA, CMA: fix possible memory leak
  DMA, CMA: separate core CMA management codes from DMA APIs
  DMA, CMA: support alignment constraint on CMA region
  DMA, CMA: support arbitrary bitmap granularity
  CMA: generalize CMA reserved area management functionality
  PPC, KVM, CMA: use general CMA reserved area management framework
  mm, CMA: clean-up CMA allocation error path
  mm, CMA: change cma_declare_contiguous() to obey coding convention
  mm, CMA: clean-up log message

 arch/arm/mm/dma-mapping.c|1 +
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |4 +-
 arch/powerpc/kvm/book3s_hv_builtin.c |   19 +-
 arch/powerpc/kvm/book3s_hv_cma.c |  240 
 arch/powerpc/kvm/book3s_hv_cma.h |   27 ---
 drivers/base/Kconfig |   10 -
 drivers/base/dma-contiguous.c|  210 ++---
 include/linux/cma.h  |   21 +++
 include/linux/dma-contiguous.h   |   11 +-
 mm/Kconfig   |   11 ++
 mm/Makefile  |1 +
 mm/cma.c |  335 ++
 12 files changed, 397 insertions(+), 493 deletions(-)
 delete mode 100644 arch/powerpc/kvm/book3s_hv_cma.c
 delete mode 100644 arch/powerpc/kvm/book3s_hv_cma.h
 create mode 100644 include/linux/cma.h
 create mode 100644 mm/cma.c

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 9/9] mm, CMA: clean-up log message

2014-06-15 Thread Joonsoo Kim
We don't need explicit 'CMA:' prefix, since we already define prefix
'cma:' in pr_fmt. So remove it.

Acked-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/cma.c b/mm/cma.c
index 9961120..4b251b0 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -225,12 +225,12 @@ int __init cma_declare_contiguous(phys_addr_t base,
*res_cma = cma;
cma_area_count++;
 
-   pr_info(CMA: reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
+   pr_info(Reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
(unsigned long)base);
return 0;
 
 err:
-   pr_err(CMA: failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
+   pr_err(Failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 4/9] DMA, CMA: support arbitrary bitmap granularity

2014-06-15 Thread Joonsoo Kim
PPC KVM's CMA area management requires arbitrary bitmap granularity,
since they want to reserve very large memory and manage this region
with bitmap that one bit for several pages to reduce management overheads.
So support arbitrary bitmap granularity for following generalization.

v3: use consistent local variable name (Minchan)
use unsigned int for order_per_bit (Michal)
change clear_cma_bitmap to cma_clear_bitmap for consistency (Michal)
remove un-needed local variable, bitmap_maxno (Michal)

Acked-by: Michal Nazarewicz min...@mina86.com
Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Acked-by: Minchan Kim minc...@kernel.org
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 5f62c28..c6eeb2c 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -38,6 +38,7 @@ struct cma {
unsigned long   base_pfn;
unsigned long   count;
unsigned long   *bitmap;
+   unsigned int order_per_bit; /* Order of pages represented by one bit */
struct mutexlock;
 };
 
@@ -157,9 +158,37 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 
 static DEFINE_MUTEX(cma_mutex);
 
+static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order)
+{
+   return (1  (align_order  cma-order_per_bit)) - 1;
+}
+
+static unsigned long cma_bitmap_maxno(struct cma *cma)
+{
+   return cma-count  cma-order_per_bit;
+}
+
+static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
+   unsigned long pages)
+{
+   return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
+}
+
+static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count)
+{
+   unsigned long bitmap_no, bitmap_count;
+
+   bitmap_no = (pfn - cma-base_pfn)  cma-order_per_bit;
+   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
+
+   mutex_lock(cma-lock);
+   bitmap_clear(cma-bitmap, bitmap_no, bitmap_count);
+   mutex_unlock(cma-lock);
+}
+
 static int __init cma_activate_area(struct cma *cma)
 {
-   int bitmap_size = BITS_TO_LONGS(cma-count) * sizeof(long);
+   int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
unsigned i = cma-count  pageblock_order;
struct zone *zone;
@@ -215,9 +244,9 @@ static int __init cma_init_reserved_areas(void)
 core_initcall(cma_init_reserved_areas);
 
 static int __init __dma_contiguous_reserve_area(phys_addr_t size,
-   phys_addr_t base, phys_addr_t limit,
-   phys_addr_t alignment,
-   struct cma **res_cma, bool fixed)
+   phys_addr_t base, phys_addr_t limit,
+   phys_addr_t alignment, unsigned int order_per_bit,
+   struct cma **res_cma, bool fixed)
 {
struct cma *cma = cma_areas[cma_area_count];
int ret = 0;
@@ -249,6 +278,10 @@ static int __init 
__dma_contiguous_reserve_area(phys_addr_t size,
size = ALIGN(size, alignment);
limit = ~(alignment - 1);
 
+   /* size should be aligned with order_per_bit */
+   if (!IS_ALIGNED(size  PAGE_SHIFT, 1  order_per_bit))
+   return -EINVAL;
+
/* Reserve memory */
if (base  fixed) {
if (memblock_is_region_reserved(base, size) ||
@@ -273,6 +306,7 @@ static int __init __dma_contiguous_reserve_area(phys_addr_t 
size,
 */
cma-base_pfn = PFN_DOWN(base);
cma-count = size  PAGE_SHIFT;
+   cma-order_per_bit = order_per_bit;
*res_cma = cma;
cma_area_count++;
 
@@ -308,7 +342,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
 {
int ret;
 
-   ret = __dma_contiguous_reserve_area(size, base, limit, 0,
+   ret = __dma_contiguous_reserve_area(size, base, limit, 0, 0,
res_cma, fixed);
if (ret)
return ret;
@@ -320,17 +354,11 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
return 0;
 }
 
-static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
-{
-   mutex_lock(cma-lock);
-   bitmap_clear(cma-bitmap, pfn - cma-base_pfn, count);
-   mutex_unlock(cma-lock);
-}
-
 static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
   unsigned int align)
 {
-   unsigned long mask, pfn, pageno, start = 0;
+   unsigned long mask, pfn, start = 0;
+   unsigned long bitmap_maxno, bitmap_no, bitmap_count;
struct page *page = NULL;
int ret;
 
@@ -343,18 +371,19 @@ static struct page *__dma_alloc_from_contiguous(struct 
cma *cma, int count,
if (!count)

[PATCH V2] KVM: PPC: BOOK3S: HV: Use base page size when comparing against slb value

2014-06-15 Thread Aneesh Kumar K.V
With guests supporting Multiple page size per segment (MPSS),
hpte_page_size returns the actual page size used. Add a new function to
return base page size and use that to compare against the the page size
calculated from SLB. Without this patch a hpte lookup can fail since
we are comparing wrong page size in kvmppc_hv_find_lock_hpte.

Signed-off-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
---
Changes from V1:
* Remove obsolete comment from the code
* Update commit message

 arch/powerpc/include/asm/kvm_book3s_64.h | 19 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c  |  2 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c  |  7 ++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 34422be566ce..3d0f3fb9c6b6 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -202,8 +202,10 @@ static inline unsigned long compute_tlbie_rb(unsigned long 
v, unsigned long r,
return rb;
 }
 
-static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
+static inline unsigned long __hpte_page_size(unsigned long h, unsigned long l,
+bool is_base_size)
 {
+
int size, a_psize;
/* Look at the 8 bit LP value */
unsigned int lp = (l  LP_SHIFT)  ((1  LP_BITS) - 1);
@@ -218,14 +220,27 @@ static inline unsigned long hpte_page_size(unsigned long 
h, unsigned long l)
continue;
 
a_psize = __hpte_actual_psize(lp, size);
-   if (a_psize != -1)
+   if (a_psize != -1) {
+   if (is_base_size)
+   return 1ul  
mmu_psize_defs[size].shift;
return 1ul  mmu_psize_defs[a_psize].shift;
+   }
}
 
}
return 0;
 }
 
+static inline unsigned long hpte_page_size(unsigned long h, unsigned long l)
+{
+   return __hpte_page_size(h, l, 0);
+}
+
+static inline unsigned long hpte_base_page_size(unsigned long h, unsigned long 
l)
+{
+   return __hpte_page_size(h, l, 1);
+}
+
 static inline unsigned long hpte_rpn(unsigned long ptel, unsigned long psize)
 {
return ((ptel  HPTE_R_RPN)  ~(psize - 1))  PAGE_SHIFT;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index f53cf2eae36a..7ff45ed27c65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -1567,7 +1567,7 @@ static ssize_t kvm_htab_write(struct file *file, const 
char __user *buf,
goto out;
}
if (!rma_setup  is_vrma_hpte(v)) {
-   unsigned long psize = hpte_page_size(v, r);
+   unsigned long psize = hpte_base_page_size(v, r);
unsigned long senc = slb_pgsize_encoding(psize);
unsigned long lpcr;
 
diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c 
b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
index 87624ab5ba82..d86356bfc970 100644
--- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c
+++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c
@@ -833,13 +833,10 @@ long kvmppc_hv_find_lock_hpte(struct kvm *kvm, gva_t 
eaddr, unsigned long slb_v,
r = be64_to_cpu(hpte[i+1]);
 
/*
-* Check the HPTE again, including large page size
-* Since we don't currently allow any MPSS (mixed
-* page-size segment) page sizes, it is sufficient
-* to check against the actual page size.
+* Check the HPTE again, including base page size
 */
if ((v  valid)  (v  mask) == val 
-   hpte_page_size(v, r) == (1ul  pshift))
+   hpte_base_page_size(v, r) == (1ul  pshift))
/* Return with the HPTE still locked */
return (hash  3) + (i  1);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] KVM: PPC: Book3S HV: Enable on little endian hosts

2014-06-15 Thread Aneesh Kumar K.V
Alexander Graf ag...@suse.de writes:

 So far we've been able to successfully run HV KVM on big endian hosts, but
 once you dive into little endian land things start to fall apart.

 This patch set enables HV KVM for little endian hosts. This should be the
 final piece left missing to get little endian systems fully en par with big
 endian ones in the KVM world.

Can we also use types like __be64 where we need them ? That will also
make sure that tools like sparse can catch errors ?

-aneesh

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/10] DMA, CMA: clean-up log message

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 11:53:16AM +0200, Michal Nazarewicz wrote:
 On Thu, Jun 12 2014, Michal Nazarewicz min...@mina86.com wrote:
  I used “function(arg1, arg2, …)” at the *beginning* of functions when
  the arguments passed to the function were included in the message.  In
  all other cases I left it at just “function:” (or just no additional
  prefix).  IMO that's a reasonable strategy.
 
 At closer inspection, I realised drivers/base/dma-contiguous.c is
 Marek's code, but the above I think is still reasonable thing to do, so
 I'd rather standardise on having “function(…)” only at the beginning of
 a function.  Just my 0.02 CHF.

Hello,

Now, I realize that these changes aren't needed in this patchset, so I
simplify this patch just to remove redundant 'CMA' prefix. Other things
can be done after merging if we need.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 04/10] DMA, CMA: support alignment constraint on cma region

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 12:02:38PM +0200, Michal Nazarewicz wrote:
 On Thu, Jun 12 2014, Joonsoo Kim iamjoonsoo@lge.com wrote:
  ppc kvm's cma area management needs alignment constraint on
 
 I've noticed it earlier and cannot seem to get to terms with this.  It
 should IMO be PPC, KVM and CMA since those are acronyms.  But if you
 have strong feelings, it's not a big issue.

Yes, I will fix it.

 
  cma region. So support it to prepare generalization of cma area
  management functionality.
 
  Additionally, add some comments which tell us why alignment
  constraint is needed on cma region.
 
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 Acked-by: Michal Nazarewicz min...@mina86.com
 
  diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
  index 8a44c82..bc4c171 100644
  --- a/drivers/base/dma-contiguous.c
  +++ b/drivers/base/dma-contiguous.c
  @@ -219,6 +220,7 @@ core_initcall(cma_init_reserved_areas);
* @size: Size of the reserved area (in bytes),
* @base: Base address of the reserved area optional, use 0 for any
* @limit: End address of the reserved memory (optional, 0 for any).
  + * @alignment: Alignment for the contiguous memory area, should be
  power of 2
 
 “must be power of 2 or zero”.

Okay.

* @res_cma: Pointer to store the created cma region.
* @fixed: hint about where to place the reserved area
*
  @@ -233,15 +235,15 @@ core_initcall(cma_init_reserved_areas);
*/
   static int __init __dma_contiguous_reserve_area(phys_addr_t size,
  phys_addr_t base, phys_addr_t limit,
  +   phys_addr_t alignment,
  struct cma **res_cma, bool fixed)
   {
  struct cma *cma = cma_areas[cma_area_count];
  -   phys_addr_t alignment;
  int ret = 0;
   
  -   pr_debug(%s(size %lx, base %08lx, limit %08lx)\n, __func__,
  -(unsigned long)size, (unsigned long)base,
  -(unsigned long)limit);
  +   pr_debug(%s(size %lx, base %08lx, limit %08lx align_order %08lx)\n,
  +   __func__, (unsigned long)size, (unsigned long)base,
  +   (unsigned long)limit, (unsigned long)alignment);
 
 Nit: Align with the rest of the arguments, i.e.:
 
 + pr_debug(%s(size %lx, base %08lx, limit %08lx align_order %08lx)\n,
 +  __func__, (unsigned long)size, (unsigned long)base,
 +  (unsigned long)limit, (unsigned long)alignment);

What's the difference between mine and yours?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 05/10] DMA, CMA: support arbitrary bitmap granularity

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 12:19:54PM +0200, Michal Nazarewicz wrote:
 On Thu, Jun 12 2014, Joonsoo Kim iamjoonsoo@lge.com wrote:
  ppc kvm's cma region management requires arbitrary bitmap granularity,
  since they want to reserve very large memory and manage this region
  with bitmap that one bit for several pages to reduce management overheads.
  So support arbitrary bitmap granularity for following generalization.
 
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 Acked-by: Michal Nazarewicz min...@mina86.com
 
  diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
  index bc4c171..9bc9340 100644
  --- a/drivers/base/dma-contiguous.c
  +++ b/drivers/base/dma-contiguous.c
  @@ -38,6 +38,7 @@ struct cma {
  unsigned long   base_pfn;
  unsigned long   count;
 
 Have you considered replacing count with maxno?

No, I haven't.
I think that count is better than maxno, since it represent number of
pages in this region.

 
  unsigned long   *bitmap;
  +   int order_per_bit; /* Order of pages represented by one bit */
 
 I'd make it unsigned.

Will fix it.

  struct mutexlock;
   };
   
  +static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int
  count)
 
 For consistency cma_clear_bitmap would make more sense I think.  On the
 other hand, you're just moving stuff around so perhaps renaming the
 function at this point is not worth it any more.

Will fix it.

  +{
  +   unsigned long bitmapno, nr_bits;
  +
  +   bitmapno = (pfn - cma-base_pfn)  cma-order_per_bit;
  +   nr_bits = cma_bitmap_pages_to_bits(cma, count);
  +
  +   mutex_lock(cma-lock);
  +   bitmap_clear(cma-bitmap, bitmapno, nr_bits);
  +   mutex_unlock(cma-lock);
  +}
  +
   static int __init cma_activate_area(struct cma *cma)
   {
  -   int bitmap_size = BITS_TO_LONGS(cma-count) * sizeof(long);
  +   int bitmap_maxno = cma_bitmap_maxno(cma);
  +   int bitmap_size = BITS_TO_LONGS(bitmap_maxno) * sizeof(long);
  unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
  unsigned i = cma-count  pageblock_order;
  struct zone *zone;
 
 bitmap_maxno is never used again, perhaps:
 
 + int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
 
 instead? Up to you.

Okay!!

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/10] DMA, CMA: separate core cma management codes from DMA APIs

2014-06-15 Thread Joonsoo Kim
On Thu, Jun 12, 2014 at 02:37:43PM +0900, Minchan Kim wrote:
 On Thu, Jun 12, 2014 at 12:21:40PM +0900, Joonsoo Kim wrote:
  To prepare future generalization work on cma area management code,
  we need to separate core cma management codes from DMA APIs.
  We will extend these core functions to cover requirements of
  ppc kvm's cma area management functionality in following patches.
  This separation helps us not to touch DMA APIs while extending
  core functions.
  
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
  
  diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
  index fb0cdce..8a44c82 100644
  --- a/drivers/base/dma-contiguous.c
  +++ b/drivers/base/dma-contiguous.c
  @@ -231,9 +231,9 @@ core_initcall(cma_init_reserved_areas);
* If @fixed is true, reserve contiguous area at exactly @base.  If false,
* reserve in range from @base to @limit.
*/
  -int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
  -  phys_addr_t limit, struct cma **res_cma,
  -  bool fixed)
  +static int __init __dma_contiguous_reserve_area(phys_addr_t size,
  +   phys_addr_t base, phys_addr_t limit,
  +   struct cma **res_cma, bool fixed)
   {
  struct cma *cma = cma_areas[cma_area_count];
  phys_addr_t alignment;
  @@ -288,16 +288,30 @@ int __init dma_contiguous_reserve_area(phys_addr_t 
  size, phys_addr_t base,
   
  pr_info(%s(): reserved %ld MiB at %08lx\n,
  __func__, (unsigned long)size / SZ_1M, (unsigned long)base);
  -
  -   /* Architecture specific contiguous memory fixup. */
  -   dma_contiguous_early_fixup(base, size);
  return 0;
  +
   err:
  pr_err(%s(): failed to reserve %ld MiB\n,
  __func__, (unsigned long)size / SZ_1M);
  return ret;
   }
   
  +int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
  +  phys_addr_t limit, struct cma **res_cma,
  +  bool fixed)
  +{
  +   int ret;
  +
  +   ret = __dma_contiguous_reserve_area(size, base, limit, res_cma, fixed);
  +   if (ret)
  +   return ret;
  +
  +   /* Architecture specific contiguous memory fixup. */
  +   dma_contiguous_early_fixup(base, size);
 
 In old, base and size are aligned with alignment and passed into arch fixup
 but your patch is changing it.
 I didn't look at what kinds of side effect it makes but just want to confirm.

Good catch!!!
I will fix it.

  +
  +   return 0;
  +}
  +
   static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
   {
  mutex_lock(cma-lock);
  @@ -316,20 +330,16 @@ static void clear_cma_bitmap(struct cma *cma, 
  unsigned long pfn, int count)
* global one. Requires architecture specific dev_get_cma_area() helper
* function.
*/
  -struct page *dma_alloc_from_contiguous(struct device *dev, int count,
  +static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
 unsigned int align)
   {
  unsigned long mask, pfn, pageno, start = 0;
  -   struct cma *cma = dev_get_cma_area(dev);
  struct page *page = NULL;
  int ret;
   
  if (!cma || !cma-count)
  return NULL;
   
  -   if (align  CONFIG_CMA_ALIGNMENT)
  -   align = CONFIG_CMA_ALIGNMENT;
  -
  pr_debug(%s(cma %p, count %d, align %d)\n, __func__, (void *)cma,
   count, align);
   
  @@ -377,6 +387,17 @@ struct page *dma_alloc_from_contiguous(struct device 
  *dev, int count,
  return page;
   }
   
 
 Please move the description in __dma_alloc_from_contiguous to here exported 
 API.
 

Okay.

  +struct page *dma_alloc_from_contiguous(struct device *dev, int count,
  +  unsigned int align)
  +{
  +   struct cma *cma = dev_get_cma_area(dev);
  +
  +   if (align  CONFIG_CMA_ALIGNMENT)
  +   align = CONFIG_CMA_ALIGNMENT;
  +
  +   return __dma_alloc_from_contiguous(cma, count, align);
  +}
  +
   /**
* dma_release_from_contiguous() - release allocated pages
* @dev:   Pointer to device for which the pages were allocated.
  @@ -387,10 +408,9 @@ struct page *dma_alloc_from_contiguous(struct device 
  *dev, int count,
* It returns false when provided pages do not belong to contiguous area 
  and
* true otherwise.
*/
  -bool dma_release_from_contiguous(struct device *dev, struct page *pages,
  +static bool __dma_release_from_contiguous(struct cma *cma, struct page 
  *pages,
   int count)
   {
  -   struct cma *cma = dev_get_cma_area(dev);
  unsigned long pfn;
   
  if (!cma || !pages)
  @@ -410,3 +430,11 @@ bool dma_release_from_contiguous(struct device *dev, 
  struct page *pages,
   
  return true;
   }
  +
 
 Ditto.
 
Okay.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to 

Re: [PATCH v2 06/10] CMA: generalize CMA reserved area management functionality

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 03:46:44PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Currently, there are two users on CMA functionality, one is the DMA
  subsystem and the other is the kvm on powerpc. They have their own code
  to manage CMA reserved area even if they looks really similar.
  From my guess, it is caused by some needs on bitmap management. Kvm side
  wants to maintain bitmap not for 1 page, but for more size. Eventually it
  use bitmap where one bit represents 64 pages.
 
  When I implement CMA related patches, I should change those two places
  to apply my change and it seem to be painful to me. I want to change
  this situation and reduce future code management overhead through
  this patch.
 
  This change could also help developer who want to use CMA in their
  new feature development, since they can use CMA easily without
  copying  pasting this reserved area management code.
 
  In previous patches, we have prepared some features to generalize
  CMA reserved area management and now it's time to do it. This patch
  moves core functions to mm/cma.c and change DMA APIs to use
  these functions.
 
  There is no functional change in DMA APIs.
 
  v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
 
  Acked-by: Michal Nazarewicz min...@mina86.com
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 
 .
 
  +
  +   mask = cma_bitmap_aligned_mask(cma, align);
  +   bitmap_maxno = cma_bitmap_maxno(cma);
  +   nr_bits = cma_bitmap_pages_to_bits(cma, count);
  +
  +   for (;;) {
  +   mutex_lock(cma-lock);
  +   bitmapno = bitmap_find_next_zero_area(cma-bitmap,
  +   bitmap_maxno, start, nr_bits, mask);
  +   if (bitmapno = bitmap_maxno) {
  +   mutex_unlock(cma-lock);
  +   break;
  +   }
  +   bitmap_set(cma-bitmap, bitmapno, nr_bits);
  +   /*
  +* It's safe to drop the lock here. We've marked this region for
  +* our exclusive use. If the migration fails we will take the
  +* lock again and unmark it.
  +*/
  +   mutex_unlock(cma-lock);
  +
  +   pfn = cma-base_pfn + (bitmapno  cma-order_per_bit);
  +   mutex_lock(cma_mutex);
  +   ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA);
  +   mutex_unlock(cma_mutex);
  +   if (ret == 0) {
  +   page = pfn_to_page(pfn);
  +   break;
  +   } else if (ret != -EBUSY) {
  +   clear_cma_bitmap(cma, pfn, count);
  +   break;
  +   }
  +   
 
 
 For setting bit map we do
   bitmap_set(cma-bitmap, bitmapno, nr_bits);
 alloc_contig()..
 if (error)
 clear_cma_bitmap(cma, pfn, count);
 
 Why ?
 
 why not bitmap_clear() ?
 

Unlike your psuedo code, for setting bitmap, we do
- grab the mutex
- bitmap_set
- release the mutex

clear_cma_bitmap() handles these things.

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] PPC, KVM, CMA: use general CMA reserved area management framework

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 03:35:33PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Now, we have general CMA reserved area management framework,
  so use it for future maintainabilty. There is no functional change.
 
  Acked-by: Michal Nazarewicz min...@mina86.com
  Acked-by: Paolo Bonzini pbonz...@redhat.com
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
  diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
  b/arch/powerpc/kvm/book3s_hv_builtin.c
  index 7cde8a6..28ec226 100644
  --- a/arch/powerpc/kvm/book3s_hv_builtin.c
  +++ b/arch/powerpc/kvm/book3s_hv_builtin.c
  @@ -16,12 +16,14 @@
   #include linux/init.h
   #include linux/memblock.h
   #include linux/sizes.h
  +#include linux/cma.h
 
   #include asm/cputable.h
   #include asm/kvm_ppc.h
   #include asm/kvm_book3s.h
 
  -#include book3s_hv_cma.h
  +#define KVM_CMA_CHUNK_ORDER18
  +
   /*
* Hash page table alignment on newer cpus(CPU_FTR_ARCH_206)
* should be power of 2.
  @@ -43,6 +45,8 @@ static unsigned long kvm_cma_resv_ratio = 5;
   unsigned long kvm_rma_pages = (1  27)  PAGE_SHIFT; /* 128MB */
   EXPORT_SYMBOL_GPL(kvm_rma_pages);
 
  +static struct cma *kvm_cma;
  +
   /* Work out RMLS (real mode limit selector) field value for a given RMA 
  size.
  Assumes POWER7 or PPC970. */
   static inline int lpcr_rmls(unsigned long rma_size)
  @@ -97,7 +101,7 @@ struct kvm_rma_info *kvm_alloc_rma()
  ri = kmalloc(sizeof(struct kvm_rma_info), GFP_KERNEL);
  if (!ri)
  return NULL;
  -   page = kvm_alloc_cma(kvm_rma_pages, kvm_rma_pages);
  +   page = cma_alloc(kvm_cma, kvm_rma_pages, get_order(kvm_rma_pages));
  if (!page)
  goto err_out;
  atomic_set(ri-use_count, 1);
  @@ -112,7 +116,7 @@ EXPORT_SYMBOL_GPL(kvm_alloc_rma);
   void kvm_release_rma(struct kvm_rma_info *ri)
   {
  if (atomic_dec_and_test(ri-use_count)) {
  -   kvm_release_cma(pfn_to_page(ri-base_pfn), kvm_rma_pages);
  +   cma_release(kvm_cma, pfn_to_page(ri-base_pfn), kvm_rma_pages);
  kfree(ri);
  }
   }
  @@ -134,13 +138,13 @@ struct page *kvm_alloc_hpt(unsigned long nr_pages)
  /* Old CPUs require HPT aligned on a multiple of its size */
  if (!cpu_has_feature(CPU_FTR_ARCH_206))
  align_pages = nr_pages;
  -   return kvm_alloc_cma(nr_pages, align_pages);
  +   return cma_alloc(kvm_cma, nr_pages, get_order(align_pages));
   }
   EXPORT_SYMBOL_GPL(kvm_alloc_hpt);
 
   void kvm_release_hpt(struct page *page, unsigned long nr_pages)
   {
  -   kvm_release_cma(page, nr_pages);
  +   cma_release(kvm_cma, page, nr_pages);
   }
   EXPORT_SYMBOL_GPL(kvm_release_hpt);
 
  @@ -179,7 +183,8 @@ void __init kvm_cma_reserve(void)
  align_size = HPT_ALIGN_PAGES  PAGE_SHIFT;
 
  align_size = max(kvm_rma_pages  PAGE_SHIFT, align_size);
  -   kvm_cma_declare_contiguous(selected_size, align_size);
  +   cma_declare_contiguous(selected_size, 0, 0, align_size,
  +   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, kvm_cma, false);
  }
   }
 
  diff --git a/arch/powerpc/kvm/book3s_hv_cma.c 
  b/arch/powerpc/kvm/book3s_hv_cma.c
  deleted file mode 100644
  index d9d3d85..000
  --- a/arch/powerpc/kvm/book3s_hv_cma.c
  +++ /dev/null
  @@ -1,240 +0,0 @@
  -/*
  - * Contiguous Memory Allocator for ppc KVM hash pagetable  based on CMA
  - * for DMA mapping framework
  - *
  - * Copyright IBM Corporation, 2013
  - * Author Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
  - *
  - * This program is free software; you can redistribute it and/or
  - * modify it under the terms of the GNU General Public License as
  - * published by the Free Software Foundation; either version 2 of the
  - * License or (at your optional) any later version of the license.
  - *
  - */
  -#define pr_fmt(fmt) kvm_cma:  fmt
  -
  -#ifdef CONFIG_CMA_DEBUG
  -#ifndef DEBUG
  -#  define DEBUG
  -#endif
  -#endif
  -
  -#include linux/memblock.h
  -#include linux/mutex.h
  -#include linux/sizes.h
  -#include linux/slab.h
  -
  -#include book3s_hv_cma.h
  -
  -struct kvm_cma {
  -   unsigned long   base_pfn;
  -   unsigned long   count;
  -   unsigned long   *bitmap;
  -};
  -
  -static DEFINE_MUTEX(kvm_cma_mutex);
  -static struct kvm_cma kvm_cma_area;
  -
  -/**
  - * kvm_cma_declare_contiguous() - reserve area for contiguous memory 
  handling
  - *   for kvm hash pagetable
  - * @size:  Size of the reserved memory.
  - * @alignment:  Alignment for the contiguous memory area
  - *
  - * This function reserves memory for kvm cma area. It should be
  - * called by arch code when early allocator (memblock or bootmem)
  - * is still activate.
  - */
  -long __init kvm_cma_declare_contiguous(phys_addr_t size, phys_addr_t 
  alignment)
  -{
  -   long base_pfn;
  -   phys_addr_t addr;
  -   struct kvm_cma *cma = kvm_cma_area;
  -
  -   pr_debug(%s(size %lx)\n, __func__, (unsigned long)size);
  -
  -   if (!size)

Re: [PATCH v2 00/10] CMA: generalize CMA reserved area management code

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 12:55:39PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Currently, there are two users on CMA functionality, one is the DMA
  subsystem and the other is the kvm on powerpc. They have their own code
  to manage CMA reserved area even if they looks really similar.
  From my guess, it is caused by some needs on bitmap management. Kvm side
  wants to maintain bitmap not for 1 page, but for more size. Eventually it
  use bitmap where one bit represents 64 pages.
 
  When I implement CMA related patches, I should change those two places
  to apply my change and it seem to be painful to me. I want to change
  this situation and reduce future code management overhead through
  this patch.
 
  This change could also help developer who want to use CMA in their
  new feature development, since they can use CMA easily without
  copying  pasting this reserved area management code.
 
  v2:
Although this patchset looks very different with v1, the end result,
that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7.
 
  Patch 1-5 prepare some features to cover ppc kvm's requirements.
  Patch 6-7 generalize CMA reserved area management code and change users
  to use it.
  Patch 8-10 clean-up minor things.
 
 
 I wanted to test the ppc changes and found that the patch series doesn't apply
 against v3.15 . Do you have a kernel tree which I can clone to test this
 series ?

This is based on linux-next -next-20140610.
And my tree is on following link.

https://github.com/JoonsooKim/linux/tree/cma-general-v2.0-next-20140610

But, I think I'm late, because you have already added a Tested-by tag.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] PPC, KVM, CMA: use general CMA reserved area management framework

2014-06-15 Thread Joonsoo Kim
On Sat, Jun 14, 2014 at 02:23:59PM +0530, Aneesh Kumar K.V wrote:
 Joonsoo Kim iamjoonsoo@lge.com writes:
 
  Now, we have general CMA reserved area management framework,
  so use it for future maintainabilty. There is no functional change.
 
  Acked-by: Michal Nazarewicz min...@mina86.com
  Acked-by: Paolo Bonzini pbonz...@redhat.com
  Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com
 
 Need this. We may want to keep the VM_BUG_ON by moving
 KVM_CMA_CHUNK_ORDER around.
 
 diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
 b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 index 8056107..1932e0e 100644
 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
 +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
 @@ -37,8 +37,6 @@
  #include asm/ppc-opcode.h
  #include asm/cputable.h
  
 -#include book3s_hv_cma.h
 -
  /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
  #define MAX_LPID_970   63
  
 @@ -64,7 +62,6 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
 }
  
 kvm-arch.hpt_cma_alloc = 0;
 -   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
 page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
 if (page) {
 hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
 
 
 
 -aneesh

Okay.
So do you also want this?

@@ -131,16 +135,18 @@ struct page *kvm_alloc_hpt(unsigned long nr_pages)
 {
unsigned long align_pages = HPT_ALIGN_PAGES;
 
+   VM_BUG_ON(get_order(nr_pages)  KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
+
/* Old CPUs require HPT aligned on a multiple of its size */
if (!cpu_has_feature(CPU_FTR_ARCH_206))
align_pages = nr_pages;
-   return kvm_alloc_cma(nr_pages, align_pages);
+   return cma_alloc(kvm_cma, nr_pages, get_order(align_pages));
 }

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 1/9] DMA, CMA: fix possible memory leak

2014-06-15 Thread Joonsoo Kim
We should free memory for bitmap when we find zone mis-match,
otherwise this memory will leak.

Additionally, I copy code comment from PPC KVM's CMA code to inform
why we need to check zone mis-match.

* Note
Minchan suggested to add a tag for the stable, but, I don't do it,
because I found this possibility during code-review and, IMO,
this patch isn't suitable for stable tree.

Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Reviewed-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 83969f8..6467c91 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -176,14 +176,24 @@ static int __init cma_activate_area(struct cma *cma)
base_pfn = pfn;
for (j = pageblock_nr_pages; j; --j, pfn++) {
WARN_ON_ONCE(!pfn_valid(pfn));
+   /*
+* alloc_contig_range requires the pfn range
+* specified to be in the same zone. Make this
+* simple by forcing the entire CMA resv range
+* to be in the same zone.
+*/
if (page_zone(pfn_to_page(pfn)) != zone)
-   return -EINVAL;
+   goto err;
}
init_cma_reserved_pageblock(pfn_to_page(base_pfn));
} while (--i);
 
mutex_init(cma-lock);
return 0;
+
+err:
+   kfree(cma-bitmap);
+   return -EINVAL;
 }
 
 static struct cma cma_areas[MAX_CMA_AREAS];
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 6/9] PPC, KVM, CMA: use general CMA reserved area management framework

2014-06-15 Thread Joonsoo Kim
Now, we have general CMA reserved area management framework,
so use it for future maintainabilty. There is no functional change.

v3: add zeroing to CMA region (Aneesh)
fix compile error (Aneesh)
move VM_BUG_ON() to kvm_alloc_hpt() in book3s_hv_builtin.c (Aneesh)

Acked-by: Michal Nazarewicz min...@mina86.com
Acked-by: Paolo Bonzini pbonz...@redhat.com
Tested-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8056107..a41e625 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -37,8 +37,6 @@
 #include asm/ppc-opcode.h
 #include asm/cputable.h
 
-#include book3s_hv_cma.h
-
 /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
 #define MAX_LPID_970   63
 
@@ -64,10 +62,10 @@ long kvmppc_alloc_hpt(struct kvm *kvm, u32 *htab_orderp)
}
 
kvm-arch.hpt_cma_alloc = 0;
-   VM_BUG_ON(order  KVM_CMA_CHUNK_ORDER);
page = kvm_alloc_hpt(1  (order - PAGE_SHIFT));
if (page) {
hpt = (unsigned long)pfn_to_kaddr(page_to_pfn(page));
+   memset((void *)hpt, 0, (1  order));
kvm-arch.hpt_cma_alloc = 1;
}
 
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 7cde8a6..3960e0b 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -16,12 +16,14 @@
 #include linux/init.h
 #include linux/memblock.h
 #include linux/sizes.h
+#include linux/cma.h
 
 #include asm/cputable.h
 #include asm/kvm_ppc.h
 #include asm/kvm_book3s.h
 
-#include book3s_hv_cma.h
+#define KVM_CMA_CHUNK_ORDER18
+
 /*
  * Hash page table alignment on newer cpus(CPU_FTR_ARCH_206)
  * should be power of 2.
@@ -43,6 +45,8 @@ static unsigned long kvm_cma_resv_ratio = 5;
 unsigned long kvm_rma_pages = (1  27)  PAGE_SHIFT; /* 128MB */
 EXPORT_SYMBOL_GPL(kvm_rma_pages);
 
+static struct cma *kvm_cma;
+
 /* Work out RMLS (real mode limit selector) field value for a given RMA size.
Assumes POWER7 or PPC970. */
 static inline int lpcr_rmls(unsigned long rma_size)
@@ -97,7 +101,7 @@ struct kvm_rma_info *kvm_alloc_rma()
ri = kmalloc(sizeof(struct kvm_rma_info), GFP_KERNEL);
if (!ri)
return NULL;
-   page = kvm_alloc_cma(kvm_rma_pages, kvm_rma_pages);
+   page = cma_alloc(kvm_cma, kvm_rma_pages, get_order(kvm_rma_pages));
if (!page)
goto err_out;
atomic_set(ri-use_count, 1);
@@ -112,7 +116,7 @@ EXPORT_SYMBOL_GPL(kvm_alloc_rma);
 void kvm_release_rma(struct kvm_rma_info *ri)
 {
if (atomic_dec_and_test(ri-use_count)) {
-   kvm_release_cma(pfn_to_page(ri-base_pfn), kvm_rma_pages);
+   cma_release(kvm_cma, pfn_to_page(ri-base_pfn), kvm_rma_pages);
kfree(ri);
}
 }
@@ -131,16 +135,18 @@ struct page *kvm_alloc_hpt(unsigned long nr_pages)
 {
unsigned long align_pages = HPT_ALIGN_PAGES;
 
+   VM_BUG_ON(get_order(nr_pages)  KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
+
/* Old CPUs require HPT aligned on a multiple of its size */
if (!cpu_has_feature(CPU_FTR_ARCH_206))
align_pages = nr_pages;
-   return kvm_alloc_cma(nr_pages, align_pages);
+   return cma_alloc(kvm_cma, nr_pages, get_order(align_pages));
 }
 EXPORT_SYMBOL_GPL(kvm_alloc_hpt);
 
 void kvm_release_hpt(struct page *page, unsigned long nr_pages)
 {
-   kvm_release_cma(page, nr_pages);
+   cma_release(kvm_cma, page, nr_pages);
 }
 EXPORT_SYMBOL_GPL(kvm_release_hpt);
 
@@ -179,7 +185,8 @@ void __init kvm_cma_reserve(void)
align_size = HPT_ALIGN_PAGES  PAGE_SHIFT;
 
align_size = max(kvm_rma_pages  PAGE_SHIFT, align_size);
-   kvm_cma_declare_contiguous(selected_size, align_size);
+   cma_declare_contiguous(selected_size, 0, 0, align_size,
+   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, kvm_cma, false);
}
 }
 
diff --git a/arch/powerpc/kvm/book3s_hv_cma.c b/arch/powerpc/kvm/book3s_hv_cma.c
deleted file mode 100644
index d9d3d85..000
--- a/arch/powerpc/kvm/book3s_hv_cma.c
+++ /dev/null
@@ -1,240 +0,0 @@
-/*
- * Contiguous Memory Allocator for ppc KVM hash pagetable  based on CMA
- * for DMA mapping framework
- *
- * Copyright IBM Corporation, 2013
- * Author Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License as
- * published by the Free Software Foundation; either version 2 of the
- * License or (at your optional) any later version of the license.
- *
- */
-#define pr_fmt(fmt) kvm_cma:  fmt
-
-#ifdef CONFIG_CMA_DEBUG
-#ifndef DEBUG
-#  define DEBUG
-#endif
-#endif
-
-#include linux/memblock.h
-#include linux/mutex.h
-#include linux/sizes.h
-#include linux/slab.h
-

[PATCH v3 -next 4/9] DMA, CMA: support arbitrary bitmap granularity

2014-06-15 Thread Joonsoo Kim
PPC KVM's CMA area management requires arbitrary bitmap granularity,
since they want to reserve very large memory and manage this region
with bitmap that one bit for several pages to reduce management overheads.
So support arbitrary bitmap granularity for following generalization.

v3: use consistent local variable name (Minchan)
use unsigned int for order_per_bit (Michal)
change clear_cma_bitmap to cma_clear_bitmap for consistency (Michal)
remove un-needed local variable, bitmap_maxno (Michal)

Acked-by: Michal Nazarewicz min...@mina86.com
Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Acked-by: Minchan Kim minc...@kernel.org
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 5f62c28..c6eeb2c 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -38,6 +38,7 @@ struct cma {
unsigned long   base_pfn;
unsigned long   count;
unsigned long   *bitmap;
+   unsigned int order_per_bit; /* Order of pages represented by one bit */
struct mutexlock;
 };
 
@@ -157,9 +158,37 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
 
 static DEFINE_MUTEX(cma_mutex);
 
+static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order)
+{
+   return (1  (align_order  cma-order_per_bit)) - 1;
+}
+
+static unsigned long cma_bitmap_maxno(struct cma *cma)
+{
+   return cma-count  cma-order_per_bit;
+}
+
+static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
+   unsigned long pages)
+{
+   return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
+}
+
+static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count)
+{
+   unsigned long bitmap_no, bitmap_count;
+
+   bitmap_no = (pfn - cma-base_pfn)  cma-order_per_bit;
+   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
+
+   mutex_lock(cma-lock);
+   bitmap_clear(cma-bitmap, bitmap_no, bitmap_count);
+   mutex_unlock(cma-lock);
+}
+
 static int __init cma_activate_area(struct cma *cma)
 {
-   int bitmap_size = BITS_TO_LONGS(cma-count) * sizeof(long);
+   int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
unsigned i = cma-count  pageblock_order;
struct zone *zone;
@@ -215,9 +244,9 @@ static int __init cma_init_reserved_areas(void)
 core_initcall(cma_init_reserved_areas);
 
 static int __init __dma_contiguous_reserve_area(phys_addr_t size,
-   phys_addr_t base, phys_addr_t limit,
-   phys_addr_t alignment,
-   struct cma **res_cma, bool fixed)
+   phys_addr_t base, phys_addr_t limit,
+   phys_addr_t alignment, unsigned int order_per_bit,
+   struct cma **res_cma, bool fixed)
 {
struct cma *cma = cma_areas[cma_area_count];
int ret = 0;
@@ -249,6 +278,10 @@ static int __init 
__dma_contiguous_reserve_area(phys_addr_t size,
size = ALIGN(size, alignment);
limit = ~(alignment - 1);
 
+   /* size should be aligned with order_per_bit */
+   if (!IS_ALIGNED(size  PAGE_SHIFT, 1  order_per_bit))
+   return -EINVAL;
+
/* Reserve memory */
if (base  fixed) {
if (memblock_is_region_reserved(base, size) ||
@@ -273,6 +306,7 @@ static int __init __dma_contiguous_reserve_area(phys_addr_t 
size,
 */
cma-base_pfn = PFN_DOWN(base);
cma-count = size  PAGE_SHIFT;
+   cma-order_per_bit = order_per_bit;
*res_cma = cma;
cma_area_count++;
 
@@ -308,7 +342,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
 {
int ret;
 
-   ret = __dma_contiguous_reserve_area(size, base, limit, 0,
+   ret = __dma_contiguous_reserve_area(size, base, limit, 0, 0,
res_cma, fixed);
if (ret)
return ret;
@@ -320,17 +354,11 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
return 0;
 }
 
-static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
-{
-   mutex_lock(cma-lock);
-   bitmap_clear(cma-bitmap, pfn - cma-base_pfn, count);
-   mutex_unlock(cma-lock);
-}
-
 static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
   unsigned int align)
 {
-   unsigned long mask, pfn, pageno, start = 0;
+   unsigned long mask, pfn, start = 0;
+   unsigned long bitmap_maxno, bitmap_no, bitmap_count;
struct page *page = NULL;
int ret;
 
@@ -343,18 +371,19 @@ static struct page *__dma_alloc_from_contiguous(struct 
cma *cma, int count,
if (!count)

[PATCH v3 -next 2/9] DMA, CMA: separate core CMA management codes from DMA APIs

2014-06-15 Thread Joonsoo Kim
To prepare future generalization work on CMA area management code,
we need to separate core CMA management codes from DMA APIs.
We will extend these core functions to cover requirements of
PPC KVM's CMA area management functionality in following patches.
This separation helps us not to touch DMA APIs while extending
core functions.

v3: move decriptions to exporeted APIs (Minchan)
pass aligned base and size to dma_contiguous_early_fixup() (Minchan)

Acked-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 6467c91..9021762 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -213,26 +213,9 @@ static int __init cma_init_reserved_areas(void)
 }
 core_initcall(cma_init_reserved_areas);
 
-/**
- * dma_contiguous_reserve_area() - reserve custom contiguous area
- * @size: Size of the reserved area (in bytes),
- * @base: Base address of the reserved area optional, use 0 for any
- * @limit: End address of the reserved memory (optional, 0 for any).
- * @res_cma: Pointer to store the created cma region.
- * @fixed: hint about where to place the reserved area
- *
- * This function reserves memory from early allocator. It should be
- * called by arch specific code once the early allocator (memblock or bootmem)
- * has been activated and all other subsystems have already allocated/reserved
- * memory. This function allows to create custom reserved areas for specific
- * devices.
- *
- * If @fixed is true, reserve contiguous area at exactly @base.  If false,
- * reserve in range from @base to @limit.
- */
-int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
-  phys_addr_t limit, struct cma **res_cma,
-  bool fixed)
+static int __init __dma_contiguous_reserve_area(phys_addr_t size,
+   phys_addr_t base, phys_addr_t limit,
+   struct cma **res_cma, bool fixed)
 {
struct cma *cma = cma_areas[cma_area_count];
phys_addr_t alignment;
@@ -286,15 +269,47 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
 
pr_info(CMA: reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
(unsigned long)base);
-
-   /* Architecture specific contiguous memory fixup. */
-   dma_contiguous_early_fixup(base, size);
return 0;
+
 err:
pr_err(CMA: failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
return ret;
 }
 
+/**
+ * dma_contiguous_reserve_area() - reserve custom contiguous area
+ * @size: Size of the reserved area (in bytes),
+ * @base: Base address of the reserved area optional, use 0 for any
+ * @limit: End address of the reserved memory (optional, 0 for any).
+ * @res_cma: Pointer to store the created cma region.
+ * @fixed: hint about where to place the reserved area
+ *
+ * This function reserves memory from early allocator. It should be
+ * called by arch specific code once the early allocator (memblock or bootmem)
+ * has been activated and all other subsystems have already allocated/reserved
+ * memory. This function allows to create custom reserved areas for specific
+ * devices.
+ *
+ * If @fixed is true, reserve contiguous area at exactly @base.  If false,
+ * reserve in range from @base to @limit.
+ */
+int __init dma_contiguous_reserve_area(phys_addr_t size, phys_addr_t base,
+  phys_addr_t limit, struct cma **res_cma,
+  bool fixed)
+{
+   int ret;
+
+   ret = __dma_contiguous_reserve_area(size, base, limit, res_cma, fixed);
+   if (ret)
+   return ret;
+
+   /* Architecture specific contiguous memory fixup. */
+   dma_contiguous_early_fixup(PFN_PHYS((*res_cma)-base_pfn),
+   (*res_cma)-count  PAGE_SHIFT);
+
+   return 0;
+}
+
 static void clear_cma_bitmap(struct cma *cma, unsigned long pfn, int count)
 {
mutex_lock(cma-lock);
@@ -302,31 +317,16 @@ static void clear_cma_bitmap(struct cma *cma, unsigned 
long pfn, int count)
mutex_unlock(cma-lock);
 }
 
-/**
- * dma_alloc_from_contiguous() - allocate pages from contiguous area
- * @dev:   Pointer to device for which the allocation is performed.
- * @count: Requested number of pages.
- * @align: Requested alignment of pages (in PAGE_SIZE order).
- *
- * This function allocates memory buffer for specified device. It uses
- * device specific contiguous memory area if available or the default
- * global one. Requires architecture specific dev_get_cma_area() helper
- * function.
- */
-struct page *dma_alloc_from_contiguous(struct device *dev, int count,
+static struct page *__dma_alloc_from_contiguous(struct cma *cma, int count,
   

[PATCH v3 -next 8/9] mm, CMA: change cma_declare_contiguous() to obey coding convention

2014-06-15 Thread Joonsoo Kim
Conventionally, we put output param to the end of param list
and put the 'base' ahead of 'size', but cma_declare_contiguous()
doesn't look like that, so change it.

Additionally, move down cma_areas reference code to the position
where it is really needed.

v3: put 'base' ahead of 'size' (Minchan)

Acked-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 3960e0b..6cf498a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -185,8 +185,8 @@ void __init kvm_cma_reserve(void)
align_size = HPT_ALIGN_PAGES  PAGE_SHIFT;
 
align_size = max(kvm_rma_pages  PAGE_SHIFT, align_size);
-   cma_declare_contiguous(selected_size, 0, 0, align_size,
-   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, kvm_cma, false);
+   cma_declare_contiguous(0, selected_size, 0, align_size,
+   KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, false, kvm_cma);
}
 }
 
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 0411c1c..6606abd 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -165,7 +165,7 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
 {
int ret;
 
-   ret = cma_declare_contiguous(size, base, limit, 0, 0, res_cma, fixed);
+   ret = cma_declare_contiguous(base, size, limit, 0, 0, fixed, res_cma);
if (ret)
return ret;
 
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 69d3726..32cab7a 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -15,7 +15,7 @@ extern unsigned long cma_get_size(struct cma *cma);
 extern int __init cma_declare_contiguous(phys_addr_t size,
phys_addr_t base, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
-   struct cma **res_cma, bool fixed);
+   bool fixed, struct cma **res_cma);
 extern struct page *cma_alloc(struct cma *cma, int count, unsigned int align);
 extern bool cma_release(struct cma *cma, struct page *pages, int count);
 #endif
diff --git a/mm/cma.c b/mm/cma.c
index b442a13..9961120 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -141,13 +141,13 @@ core_initcall(cma_init_reserved_areas);
 
 /**
  * cma_declare_contiguous() - reserve custom contiguous area
- * @size: Size of the reserved area (in bytes),
  * @base: Base address of the reserved area optional, use 0 for any
+ * @size: Size of the reserved area (in bytes),
  * @limit: End address of the reserved memory (optional, 0 for any).
  * @alignment: Alignment for the CMA area, should be power of 2 or zero
  * @order_per_bit: Order of pages represented by one bit on bitmap.
- * @res_cma: Pointer to store the created cma region.
  * @fixed: hint about where to place the reserved area
+ * @res_cma: Pointer to store the created cma region.
  *
  * This function reserves memory from early allocator. It should be
  * called by arch specific code once the early allocator (memblock or bootmem)
@@ -157,12 +157,12 @@ core_initcall(cma_init_reserved_areas);
  * If @fixed is true, reserve contiguous area at exactly @base.  If false,
  * reserve in range from @base to @limit.
  */
-int __init cma_declare_contiguous(phys_addr_t size,
-   phys_addr_t base, phys_addr_t limit,
+int __init cma_declare_contiguous(phys_addr_t base,
+   phys_addr_t size, phys_addr_t limit,
phys_addr_t alignment, unsigned int order_per_bit,
-   struct cma **res_cma, bool fixed)
+   bool fixed, struct cma **res_cma)
 {
-   struct cma *cma = cma_areas[cma_area_count];
+   struct cma *cma;
int ret = 0;
 
pr_debug(%s(size %lx, base %08lx, limit %08lx alignment %08lx)\n,
@@ -218,6 +218,7 @@ int __init cma_declare_contiguous(phys_addr_t size,
 * Each reserved area must be initialised later, when more kernel
 * subsystems (like slab allocator) are available.
 */
+   cma = cma_areas[cma_area_count];
cma-base_pfn = PFN_DOWN(base);
cma-count = size  PAGE_SHIFT;
cma-order_per_bit = order_per_bit;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 5/9] CMA: generalize CMA reserved area management functionality

2014-06-15 Thread Joonsoo Kim
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar.
From my guess, it is caused by some needs on bitmap management. KVM side
wants to maintain bitmap not for 1 page, but for more size. Eventually it
use bitmap where one bit represents 64 pages.

When I implement CMA related patches, I should change those two places
to apply my change and it seem to be painful to me. I want to change
this situation and reduce future code management overhead through
this patch.

This change could also help developer who want to use CMA in their
new feature development, since they can use CMA easily without
copying  pasting this reserved area management code.

In previous patches, we have prepared some features to generalize
CMA reserved area management and now it's time to do it. This patch
moves core functions to mm/cma.c and change DMA APIs to use
these functions.

There is no functional change in DMA APIs.

v2: There is no big change from v1 in mm/cma.c. Mostly renaming.
v3: remove log2.h in dma-contiguous.c (Minchan)
add some accessor functions to pass aligned base and size to
dma_contiguous_early_fixup() function
move MAX_CMA_AREAS to cma.h

Acked-by: Michal Nazarewicz min...@mina86.com
Acked-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Acked-by: Minchan Kim minc...@kernel.org
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 4c88935..3116880 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -26,6 +26,7 @@
 #include linux/io.h
 #include linux/vmalloc.h
 #include linux/sizes.h
+#include linux/cma.h
 
 #include asm/memory.h
 #include asm/highmem.h
diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 00e13ce..4eac559 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -283,16 +283,6 @@ config CMA_ALIGNMENT
 
  If unsure, leave the default value 8.
 
-config CMA_AREAS
-   int Maximum count of the CMA device-private areas
-   default 7
-   help
- CMA allows to create CMA areas for particular devices. This parameter
- sets the maximum number of such device private CMA areas in the
- system.
-
- If unsure, leave the default value 7.
-
 endif
 
 endmenu
diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index c6eeb2c..0411c1c 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -24,25 +24,9 @@
 
 #include linux/memblock.h
 #include linux/err.h
-#include linux/mm.h
-#include linux/mutex.h
-#include linux/page-isolation.h
 #include linux/sizes.h
-#include linux/slab.h
-#include linux/swap.h
-#include linux/mm_types.h
 #include linux/dma-contiguous.h
-#include linux/log2.h
-
-struct cma {
-   unsigned long   base_pfn;
-   unsigned long   count;
-   unsigned long   *bitmap;
-   unsigned int order_per_bit; /* Order of pages represented by one bit */
-   struct mutexlock;
-};
-
-struct cma *dma_contiguous_default_area;
+#include linux/cma.h
 
 #ifdef CONFIG_CMA_SIZE_MBYTES
 #define CMA_SIZE_MBYTES CONFIG_CMA_SIZE_MBYTES
@@ -50,6 +34,8 @@ struct cma *dma_contiguous_default_area;
 #define CMA_SIZE_MBYTES 0
 #endif
 
+struct cma *dma_contiguous_default_area;
+
 /*
  * Default global CMA area size can be defined in kernel's .config.
  * This is useful mainly for distro maintainers to create a kernel
@@ -156,169 +142,6 @@ void __init dma_contiguous_reserve(phys_addr_t limit)
}
 }
 
-static DEFINE_MUTEX(cma_mutex);
-
-static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order)
-{
-   return (1  (align_order  cma-order_per_bit)) - 1;
-}
-
-static unsigned long cma_bitmap_maxno(struct cma *cma)
-{
-   return cma-count  cma-order_per_bit;
-}
-
-static unsigned long cma_bitmap_pages_to_bits(struct cma *cma,
-   unsigned long pages)
-{
-   return ALIGN(pages, 1  cma-order_per_bit)  cma-order_per_bit;
-}
-
-static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count)
-{
-   unsigned long bitmap_no, bitmap_count;
-
-   bitmap_no = (pfn - cma-base_pfn)  cma-order_per_bit;
-   bitmap_count = cma_bitmap_pages_to_bits(cma, count);
-
-   mutex_lock(cma-lock);
-   bitmap_clear(cma-bitmap, bitmap_no, bitmap_count);
-   mutex_unlock(cma-lock);
-}
-
-static int __init cma_activate_area(struct cma *cma)
-{
-   int bitmap_size = BITS_TO_LONGS(cma_bitmap_maxno(cma)) * sizeof(long);
-   unsigned long base_pfn = cma-base_pfn, pfn = base_pfn;
-   unsigned i = cma-count  pageblock_order;
-   struct zone *zone;
-
-   cma-bitmap = kzalloc(bitmap_size, GFP_KERNEL);
-
-   if (!cma-bitmap)
-   return -ENOMEM;
-
-   

[PATCH v3 -next 9/9] mm, CMA: clean-up log message

2014-06-15 Thread Joonsoo Kim
We don't need explicit 'CMA:' prefix, since we already define prefix
'cma:' in pr_fmt. So remove it.

Acked-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Zhang Yanfei zhangyan...@cn.fujitsu.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/mm/cma.c b/mm/cma.c
index 9961120..4b251b0 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -225,12 +225,12 @@ int __init cma_declare_contiguous(phys_addr_t base,
*res_cma = cma;
cma_area_count++;
 
-   pr_info(CMA: reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
+   pr_info(Reserved %ld MiB at %08lx\n, (unsigned long)size / SZ_1M,
(unsigned long)base);
return 0;
 
 err:
-   pr_err(CMA: failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
+   pr_err(Failed to reserve %ld MiB\n, (unsigned long)size / SZ_1M);
return ret;
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 -next 3/9] DMA, CMA: support alignment constraint on CMA region

2014-06-15 Thread Joonsoo Kim
PPC KVM's CMA area management needs alignment constraint on
CMA region. So support it to prepare generalization of CMA area
management functionality.

Additionally, add some comments which tell us why alignment
constraint is needed on CMA region.

v3: fix wrongly spelled word, align_order-alignment (Minchan)
clear code documentation by Minchan's comment (Minchan)

Acked-by: Michal Nazarewicz min...@mina86.com
Reviewed-by: Aneesh Kumar K.V aneesh.ku...@linux.vnet.ibm.com
Signed-off-by: Joonsoo Kim iamjoonsoo@lge.com

diff --git a/drivers/base/dma-contiguous.c b/drivers/base/dma-contiguous.c
index 9021762..5f62c28 100644
--- a/drivers/base/dma-contiguous.c
+++ b/drivers/base/dma-contiguous.c
@@ -32,6 +32,7 @@
 #include linux/swap.h
 #include linux/mm_types.h
 #include linux/dma-contiguous.h
+#include linux/log2.h
 
 struct cma {
unsigned long   base_pfn;
@@ -215,17 +216,16 @@ core_initcall(cma_init_reserved_areas);
 
 static int __init __dma_contiguous_reserve_area(phys_addr_t size,
phys_addr_t base, phys_addr_t limit,
+   phys_addr_t alignment,
struct cma **res_cma, bool fixed)
 {
struct cma *cma = cma_areas[cma_area_count];
-   phys_addr_t alignment;
int ret = 0;
 
-   pr_debug(%s(size %lx, base %08lx, limit %08lx)\n, __func__,
-(unsigned long)size, (unsigned long)base,
-(unsigned long)limit);
+   pr_debug(%s(size %lx, base %08lx, limit %08lx alignment %08lx)\n,
+   __func__, (unsigned long)size, (unsigned long)base,
+   (unsigned long)limit, (unsigned long)alignment);
 
-   /* Sanity checks */
if (cma_area_count == ARRAY_SIZE(cma_areas)) {
pr_err(Not enough slots for CMA reserved regions!\n);
return -ENOSPC;
@@ -234,8 +234,17 @@ static int __init 
__dma_contiguous_reserve_area(phys_addr_t size,
if (!size)
return -EINVAL;
 
-   /* Sanitise input arguments */
-   alignment = PAGE_SIZE  max(MAX_ORDER - 1, pageblock_order);
+   if (alignment  !is_power_of_2(alignment))
+   return -EINVAL;
+
+   /*
+* Sanitise input arguments.
+* Pages both ends in CMA area could be merged into adjacent unmovable
+* migratetype page by page allocator's buddy algorithm. In the case,
+* you couldn't get a contiguous memory, which is not what we want.
+*/
+   alignment = max(alignment,
+   (phys_addr_t)PAGE_SIZE  max(MAX_ORDER - 1, pageblock_order));
base = ALIGN(base, alignment);
size = ALIGN(size, alignment);
limit = ~(alignment - 1);
@@ -299,7 +308,8 @@ int __init dma_contiguous_reserve_area(phys_addr_t size, 
phys_addr_t base,
 {
int ret;
 
-   ret = __dma_contiguous_reserve_area(size, base, limit, res_cma, fixed);
+   ret = __dma_contiguous_reserve_area(size, base, limit, 0,
+   res_cma, fixed);
if (ret)
return ret;
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html