Re: [PATCH] SLUB use cmpxchg_local

2007-09-04 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local > emulation. Results are not good: > Hi Christoph, I tried to come up with a patch set implementing the basics of a new critical section: local_enter(flags) and

Re: [PATCH] SLUB use cmpxchg_local

2007-09-04 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local emulation. Results are not good: Hi Christoph, I tried to come up with a patch set implementing the basics of a new critical section: local_enter(flags) and local_exit(flags).

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra
On Tue, 2007-08-28 at 12:36 -0700, Christoph Lameter wrote: > On Tue, 28 Aug 2007, Peter Zijlstra wrote: > > > On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote: > > > H. One wild idea would be to use a priority futex for the slab lock? > > > That would make the slow paths

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter
On Tue, 28 Aug 2007, Mathieu Desnoyers wrote: > Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg > must always come with the acquire or release semantic. Is there any > cmpxchg equivalent on ia64 that would be acquire and release semantic > free ? This implicit memory

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter
On Tue, 28 Aug 2007, Peter Zijlstra wrote: > On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote: > > H. One wild idea would be to use a priority futex for the slab lock? > > That would make the slow paths interrupt safe without requiring interrupt > > disable? Does a futex fit into

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Mathieu Desnoyers
Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg must always come with the acquire or release semantic. Is there any cmpxchg equivalent on ia64 that would be acquire and release semantic free ? This implicit memory ordering in the instruction seems to be responsible for the

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra
On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote: > H. One wild idea would be to use a priority futex for the slab lock? > That would make the slow paths interrupt safe without requiring interrupt > disable? Does a futex fit into the page struct? Very much puzzled at what you

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra
On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote: H. One wild idea would be to use a priority futex for the slab lock? That would make the slow paths interrupt safe without requiring interrupt disable? Does a futex fit into the page struct? Very much puzzled at what you

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Mathieu Desnoyers
Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg must always come with the acquire or release semantic. Is there any cmpxchg equivalent on ia64 that would be acquire and release semantic free ? This implicit memory ordering in the instruction seems to be responsible for the

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter
On Tue, 28 Aug 2007, Peter Zijlstra wrote: On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote: H. One wild idea would be to use a priority futex for the slab lock? That would make the slow paths interrupt safe without requiring interrupt disable? Does a futex fit into the

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Christoph Lameter
On Tue, 28 Aug 2007, Mathieu Desnoyers wrote: Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg must always come with the acquire or release semantic. Is there any cmpxchg equivalent on ia64 that would be acquire and release semantic free ? This implicit memory ordering

Re: [PATCH] SLUB use cmpxchg_local

2007-08-28 Thread Peter Zijlstra
On Tue, 2007-08-28 at 12:36 -0700, Christoph Lameter wrote: On Tue, 28 Aug 2007, Peter Zijlstra wrote: On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote: H. One wild idea would be to use a priority futex for the slab lock? That would make the slow paths interrupt safe

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local emulation. Results are not good: slub/per cpu 1 times kmalloc(8)/kfree -> 105 cycles 1 times kmalloc(16)/kfree -> 104 cycles 1 times kmalloc(32)/kfree -> 105 cycles 1 times kmalloc(64)/kfree -> 104 cycles 1

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: > Hrm, I just want to certify one thing: A lot of code paths seems to go > to the slow path without requiring cmpxchg_local to execute at all. So > is the slow path more likely to be triggered by the (!object), > (!node_match) tests or by these same

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: > > > > The slow path would require disable preemption and two interrupt disables. > > If the slow path have to call new_slab, then yes. But it seems that not > > every slow path must call it, so for

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
H. One wild idea would be to use a priority futex for the slab lock? That would make the slow paths interrupt safe without requiring interrupt disable? Does a futex fit into the page struct? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: > > The slow path would require disable preemption and two interrupt disables. > If the slow path have to call new_slab, then yes. But it seems that not > every slow path must call it, so for the other slow paths, only one > interrupt disable would be

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: > > > > a clean solution source code wise. It also minimizes the interrupt > > > holdoff > > > for the non-cmpxchg_local arches. However, it means that we will have to > > > disable interrupts twice

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: > > a clean solution source code wise. It also minimizes the interrupt holdoff > > for the non-cmpxchg_local arches. However, it means that we will have to > > disable interrupts twice for the slow path. If that is too expensive then > > we need a

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > I think the simplest solution may be to leave slub as done in the patch > that we developed last week. The arch must provide a cmpxchg_local that is > performance wise the fastest possible. On x86 this is going to be the > cmpxchg_local on others

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
I think the simplest solution may be to leave slub as done in the patch that we developed last week. The arch must provide a cmpxchg_local that is performance wise the fastest possible. On x86 this is going to be the cmpxchg_local on others where cmpxchg is slower than interrupt disable/enable

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: > > > * Christoph Lameter ([EMAIL PROTECTED]) wrote: > > > On Mon, 27 Aug 2007, Peter Zijlstra wrote: > > > > > > > So, if the fast path can be done with a preempt off, it might be doable > > > > to

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: > * Christoph Lameter ([EMAIL PROTECTED]) wrote: > > On Mon, 27 Aug 2007, Peter Zijlstra wrote: > > > > > So, if the fast path can be done with a preempt off, it might be doable > > > to suffer the slow path with a per cpu lock like that. > > > >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Mon, 27 Aug 2007, Peter Zijlstra wrote: > > > So, if the fast path can be done with a preempt off, it might be doable > > to suffer the slow path with a per cpu lock like that. > > Sadly the cmpxchg_local requires local per cpu data access.

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Peter Zijlstra wrote: > So, if the fast path can be done with a preempt off, it might be doable > to suffer the slow path with a per cpu lock like that. Sadly the cmpxchg_local requires local per cpu data access. Isnt there some way to make this less expensive on RT?

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Peter Zijlstra
On Tue, 2007-08-21 at 16:14 -0700, Christoph Lameter wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > - Changed smp_rmb() for barrier(). We are not interested in read order > > across cpus, what we want is to be ordered wrt local interrupts only. > > barrier() is much cheaper than

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Peter Zijlstra
On Tue, 2007-08-21 at 16:14 -0700, Christoph Lameter wrote: On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: - Changed smp_rmb() for barrier(). We are not interested in read order across cpus, what we want is to be ordered wrt local interrupts only. barrier() is much cheaper than a

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Peter Zijlstra wrote: So, if the fast path can be done with a preempt off, it might be doable to suffer the slow path with a per cpu lock like that. Sadly the cmpxchg_local requires local per cpu data access. Isnt there some way to make this less expensive on RT? Acessing

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: On Mon, 27 Aug 2007, Peter Zijlstra wrote: So, if the fast path can be done with a preempt off, it might be doable to suffer the slow path with a per cpu lock like that. Sadly the cmpxchg_local requires local per cpu data access. Isnt there

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: * Christoph Lameter ([EMAIL PROTECTED]) wrote: On Mon, 27 Aug 2007, Peter Zijlstra wrote: So, if the fast path can be done with a preempt off, it might be doable to suffer the slow path with a per cpu lock like that. Sadly the

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: * Christoph Lameter ([EMAIL PROTECTED]) wrote: On Mon, 27 Aug 2007, Peter Zijlstra wrote: So, if the fast path can be done with a preempt off, it might be doable to suffer the slow path

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
I think the simplest solution may be to leave slub as done in the patch that we developed last week. The arch must provide a cmpxchg_local that is performance wise the fastest possible. On x86 this is going to be the cmpxchg_local on others where cmpxchg is slower than interrupt disable/enable

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: I think the simplest solution may be to leave slub as done in the patch that we developed last week. The arch must provide a cmpxchg_local that is performance wise the fastest possible. On x86 this is going to be the cmpxchg_local on others

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: a clean solution source code wise. It also minimizes the interrupt holdoff for the non-cmpxchg_local arches. However, it means that we will have to disable interrupts twice for the slow path. If that is too expensive then we need a

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: a clean solution source code wise. It also minimizes the interrupt holdoff for the non-cmpxchg_local arches. However, it means that we will have to disable interrupts twice for the slow

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: The slow path would require disable preemption and two interrupt disables. If the slow path have to call new_slab, then yes. But it seems that not every slow path must call it, so for the other slow paths, only one interrupt disable would be

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
H. One wild idea would be to use a priority futex for the slab lock? That would make the slow paths interrupt safe without requiring interrupt disable? Does a futex fit into the page struct? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: The slow path would require disable preemption and two interrupt disables. If the slow path have to call new_slab, then yes. But it seems that not every slow path must call it, so for the other

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote: Hrm, I just want to certify one thing: A lot of code paths seems to go to the slow path without requiring cmpxchg_local to execute at all. So is the slow path more likely to be triggered by the (!object), (!node_match) tests or by these same tests

Re: [PATCH] SLUB use cmpxchg_local

2007-08-27 Thread Christoph Lameter
Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local emulation. Results are not good: slub/per cpu 1 times kmalloc(8)/kfree - 105 cycles 1 times kmalloc(16)/kfree - 104 cycles 1 times kmalloc(32)/kfree - 105 cycles 1 times kmalloc(64)/kfree - 104 cycles 1 times

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
Ok so we need this. Fix up preempt checks. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- mm/slub.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6/mm/slub.c === ---

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Wed, 22 Aug 2007, Mathieu Desnoyers wrote: > > > * Christoph Lameter ([EMAIL PROTECTED]) wrote: > > > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) > > > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach > > > { >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote: > > Then the thread could be preempted and rescheduled on a different cpu > > between put_cpu and local_irq_save() which means that we loose the > > state information of the kmem_cache_cpu structure. > > > > Maybe am I misunderstanding something,

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote: > * Christoph Lameter ([EMAIL PROTECTED]) wrote: > > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) > > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach > > { > > void *prior; > > void **object = (void *)x; > > +

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach > { > void *prior; > void **object = (void *)x; > + unsigned long flags; > > +

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
Here is the current cmpxchg_local version that I used for testing. Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]> --- include/linux/slub_def.h | 10 +++--- mm/slub.c| 74 --- 2 files changed, 56 insertions(+), 28 deletions(-)

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
I can confirm Mathieus' measurement now: Athlon64: regular NUMA/discontig 1. Kmalloc: Repeatedly allocate then free test 1 times kmalloc(8) -> 79 cycles kfree -> 92 cycles 1 times kmalloc(16) -> 79 cycles kfree -> 93 cycles 1 times kmalloc(32) -> 88 cycles kfree -> 95 cycles 1

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers
Measurements on a AMD64 2.0 GHz dual-core In this test, we seem to remove 10 cycles from the kmalloc fast path. On small allocations, it gives a 14% performance increase. kfree fast path also seems to have a 10 cycles improvement. 1. Kmalloc: Repeatedly allocate then free test * cmpxchg_local

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen
On Wed, Aug 22, 2007 at 09:45:33AM -0400, Mathieu Desnoyers wrote: > Measurements on a AMD64 2.0 GHz dual-core > > In this test, we seem to remove 10 cycles from the kmalloc fast path. > On small allocations, it gives a 14% performance increase. kfree fast > path also seems to have a 10 cycles

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen
On Tue, Aug 21, 2007 at 06:06:19PM -0700, Christoph Lameter wrote: > Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz Note the P4 is a extreme case in that "unusual" instructions are quite slow (basically anything that falls out of the trace cache). Core2 tends to be

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen
On Tue, Aug 21, 2007 at 06:06:19PM -0700, Christoph Lameter wrote: Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz Note the P4 is a extreme case in that unusual instructions are quite slow (basically anything that falls out of the trace cache). Core2 tends to be much

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Andi Kleen
On Wed, Aug 22, 2007 at 09:45:33AM -0400, Mathieu Desnoyers wrote: Measurements on a AMD64 2.0 GHz dual-core In this test, we seem to remove 10 cycles from the kmalloc fast path. On small allocations, it gives a 14% performance increase. kfree fast path also seems to have a 10 cycles

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers
Measurements on a AMD64 2.0 GHz dual-core In this test, we seem to remove 10 cycles from the kmalloc fast path. On small allocations, it gives a 14% performance increase. kfree fast path also seems to have a 10 cycles improvement. 1. Kmalloc: Repeatedly allocate then free test * cmpxchg_local

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
I can confirm Mathieus' measurement now: Athlon64: regular NUMA/discontig 1. Kmalloc: Repeatedly allocate then free test 1 times kmalloc(8) - 79 cycles kfree - 92 cycles 1 times kmalloc(16) - 79 cycles kfree - 93 cycles 1 times kmalloc(32) - 88 cycles kfree - 95 cycles 1 times

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
Here is the current cmpxchg_local version that I used for testing. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- include/linux/slub_def.h | 10 +++--- mm/slub.c| 74 --- 2 files changed, 56 insertions(+), 28 deletions(-)

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach { void *prior; void **object = (void *)x; + unsigned long flags; + local_irq_save(flags); +

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote: * Christoph Lameter ([EMAIL PROTECTED]) wrote: void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach { void *prior; void **object = (void *)x; + unsigned

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote: Then the thread could be preempted and rescheduled on a different cpu between put_cpu and local_irq_save() which means that we loose the state information of the kmem_cache_cpu structure. Maybe am I misunderstanding something, but

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: On Wed, 22 Aug 2007, Mathieu Desnoyers wrote: * Christoph Lameter ([EMAIL PROTECTED]) wrote: void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags) @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach { void

Re: [PATCH] SLUB use cmpxchg_local

2007-08-22 Thread Christoph Lameter
Ok so we need this. Fix up preempt checks. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] --- mm/slub.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6/mm/slub.c === --- linux-2.6.orig/mm/slub.c

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > As I am going back through the initial cmpxchg_local implementation, it > > seems like it was executing __slab_alloc() with preemption disabled, > > which is wrong. new_slab() is not designed for

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz > (hyperthreading enabled). Test run with your module show only minor > performance improvements and lots of regressions. So we must have > cmpxchg_local to see any

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz (hyperthreading enabled). Test run with your module show only minor performance improvements and lots of regressions. So we must have cmpxchg_local to see any improvements? Some kind of a recent optimization of cmpxchg

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Andi Kleen ([EMAIL PROTECTED]) wrote: > Mathieu Desnoyers <[EMAIL PROTECTED]> writes: > > > > The measurements I get (in cycles): > > enable interrupts (STI) disable interrupts (CLI) local > > CMPXCHG > > IA32 (P4)11282 26 >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > As I am going back through the initial cmpxchg_local implementation, it > seems like it was executing __slab_alloc() with preemption disabled, > which is wrong. new_slab() is not designed for that. The version I send you did not use preemption. We

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Andi Kleen
Mathieu Desnoyers <[EMAIL PROTECTED]> writes: > > The measurements I get (in cycles): > enable interrupts (STI) disable interrupts (CLI) local > CMPXCHG > IA32 (P4)11282 26 > x86_64 AMD64 125 102

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > - Rounding error.. you seem to round at 0.1ms, but I keep the values in > > cycles. The times that you get (1.1ms) seems strangely higher than > > mine, which are under 1000 cycles on a 3GHz

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > - Rounding error.. you seem to round at 0.1ms, but I keep the values in > cycles. The times that you get (1.1ms) seems strangely higher than > mine, which are under 1000 cycles on a 3GHz system (less than 333ns). > I guess there is both a ms -

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > Are you running a UP or SMP kernel ? If you run a UP kernel, the > > cmpxchg_local and cmpxchg are identical. > > UP. > > > Oh, and if you run your tests at boot time, the alternatives code may

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > Are you running a UP or SMP kernel ? If you run a UP kernel, the > cmpxchg_local and cmpxchg are identical. UP. > Oh, and if you run your tests at boot time, the alternatives code may > have removed the lock prefix, therefore making cmpxchg and

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as > > shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles > > for the kmalloc/kfree pair (test 2). > >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: > * Christoph Lameter ([EMAIL PROTECTED]) wrote: > > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > > > - Changed smp_rmb() for barrier(). We are not interested in read order > > > across cpus, what we want is to be ordered wrt local

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as > shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles > for the kmalloc/kfree pair (test 2). H.. I wonder if the AMD processors simply do the same in

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > kmalloc(8)/kfree = 112 cycles > kmalloc(16)/kfree = 103 cycles > kmalloc(32)/kfree = 103 cycles > kmalloc(64)/kfree = 103 cycles > kmalloc(128)/kfree = 112 cycles > kmalloc(256)/kfree = 111 cycles > kmalloc(512)/kfree = 111 cycles >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > SLUB Use cmpxchg() everywhere. > > > > It applies to "SLUB: Single atomic instruction alloc/free using > > cmpxchg". > > > +++ slab/mm/slub.c 2007-08-20 18:42:28.0 -0400 > > @@ -1682,7

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > * cmpxchg_local Slub test > kmalloc(8) = 83 cycleskfree = 363 cycles > kmalloc(16) = 85 cycles kfree = 372 cycles > kmalloc(32) = 92 cycles kfree = 377 cycles > kmalloc(64) = 115 cycleskfree = 397 cycles >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > - Changed smp_rmb() for barrier(). We are not interested in read order > > across cpus, what we want is to be ordered wrt local interrupts only. > > barrier() is much cheaper than a rmb(). >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
Reformatting... * Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: > Hi Christoph, > > If you are interested in the raw numbers: > > The (very basic) test module follows. Make sure you change get_cycles() > for get_cycles_sync() if you plan to run this on x86_64. > > (tests taken on a 3GHz Pentium

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > SLUB Use cmpxchg() everywhere. > > It applies to "SLUB: Single atomic instruction alloc/free using > cmpxchg". > +++ slab/mm/slub.c2007-08-20 18:42:28.0 -0400 > @@ -1682,7 +1682,7 @@ redo: > > object[c->offset] = freelist; >

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > - Changed smp_rmb() for barrier(). We are not interested in read order > across cpus, what we want is to be ordered wrt local interrupts only. > barrier() is much cheaper than a rmb(). But this means a preempt disable is required. RT users do

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > > > - Fixed an erroneous test in slab_free() (logic was flipped from the > > original code when testing for slow path. It explains the wrong > > numbers you have with big free). > > If you look

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > Therefore, in the test where we have separate passes for slub allocation > and free, we hit mostly the slow path. Any particular reason for that ? Maybe on SMP you are schedule to run on a different processor? Note that I ran my tests at early

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > If you are interested in the raw numbers: > > The (very basic) test module follows. Make sure you change get_cycles() > for get_cycles_sync() if you plan to run this on x86_64. Which test is which? Would you be able to format this in a way that we

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: > - Fixed an erroneous test in slab_free() (logic was flipped from the > original code when testing for slow path. It explains the wrong > numbers you have with big free). If you look at the numbers that I posted earlier then you will see that

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: > Ok, I played with your patch a bit, and the results are quite > interesting: > ... > Summary: > > (tests repeated 1 times on a 3GHz Pentium 4) > (kernel DEBUG menuconfig options are turned off) > results are in cycles per iteration > I did 2

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
Hi Christoph, If you are interested in the raw numbers: The (very basic) test module follows. Make sure you change get_cycles() for get_cycles_sync() if you plan to run this on x86_64. (tests taken on a 3GHz Pentium 4) * slub HEAD, test 1 [ 99.774699] SLUB Performance testing [ 99.785431]

[PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
Ok, I played with your patch a bit, and the results are quite interesting: SLUB use cmpxchg_local my changes: - Fixed an erroneous test in slab_free() (logic was flipped from the original code when testing for slow path. It explains the wrong numbers you have with big free). - Use

[PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
Ok, I played with your patch a bit, and the results are quite interesting: SLUB use cmpxchg_local my changes: - Fixed an erroneous test in slab_free() (logic was flipped from the original code when testing for slow path. It explains the wrong numbers you have with big free). - Use

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
Hi Christoph, If you are interested in the raw numbers: The (very basic) test module follows. Make sure you change get_cycles() for get_cycles_sync() if you plan to run this on x86_64. (tests taken on a 3GHz Pentium 4) * slub HEAD, test 1 [ 99.774699] SLUB Performance testing [ 99.785431]

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: Ok, I played with your patch a bit, and the results are quite interesting: ... Summary: (tests repeated 1 times on a 3GHz Pentium 4) (kernel DEBUG menuconfig options are turned off) results are in cycles per iteration I did 2 runs of

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: - Fixed an erroneous test in slab_free() (logic was flipped from the original code when testing for slow path. It explains the wrong numbers you have with big free). If you look at the numbers that I posted earlier then you will see that

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: If you are interested in the raw numbers: The (very basic) test module follows. Make sure you change get_cycles() for get_cycles_sync() if you plan to run this on x86_64. Which test is which? Would you be able to format this in a way that we can

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: Therefore, in the test where we have separate passes for slub allocation and free, we hit mostly the slow path. Any particular reason for that ? Maybe on SMP you are schedule to run on a different processor? Note that I ran my tests at early boot

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: - Fixed an erroneous test in slab_free() (logic was flipped from the original code when testing for slow path. It explains the wrong numbers you have with big free). If you look at the

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: - Changed smp_rmb() for barrier(). We are not interested in read order across cpus, what we want is to be ordered wrt local interrupts only. barrier() is much cheaper than a rmb(). But this means a preempt disable is required. RT users do not

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: SLUB Use cmpxchg() everywhere. It applies to SLUB: Single atomic instruction alloc/free using cmpxchg. +++ slab/mm/slub.c2007-08-20 18:42:28.0 -0400 @@ -1682,7 +1682,7 @@ redo: object[c-offset] = freelist; - if

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
Reformatting... * Mathieu Desnoyers ([EMAIL PROTECTED]) wrote: Hi Christoph, If you are interested in the raw numbers: The (very basic) test module follows. Make sure you change get_cycles() for get_cycles_sync() if you plan to run this on x86_64. (tests taken on a 3GHz Pentium 4)

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Mathieu Desnoyers
* Christoph Lameter ([EMAIL PROTECTED]) wrote: On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: - Changed smp_rmb() for barrier(). We are not interested in read order across cpus, what we want is to be ordered wrt local interrupts only. barrier() is much cheaper than a rmb(). But this

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: * cmpxchg_local Slub test kmalloc(8) = 83 cycleskfree = 363 cycles kmalloc(16) = 85 cycles kfree = 372 cycles kmalloc(32) = 92 cycles kfree = 377 cycles kmalloc(64) = 115 cycleskfree = 397 cycles

Re: [PATCH] SLUB use cmpxchg_local

2007-08-21 Thread Christoph Lameter
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote: kmalloc(8)/kfree = 112 cycles kmalloc(16)/kfree = 103 cycles kmalloc(32)/kfree = 103 cycles kmalloc(64)/kfree = 103 cycles kmalloc(128)/kfree = 112 cycles kmalloc(256)/kfree = 111 cycles kmalloc(512)/kfree = 111 cycles kmalloc(1024)/kfree =

  1   2   >