* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local
> emulation. Results are not good:
>
Hi Christoph,
I tried to come up with a patch set implementing the basics of a new
critical section: local_enter(flags) and
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local
emulation. Results are not good:
Hi Christoph,
I tried to come up with a patch set implementing the basics of a new
critical section: local_enter(flags) and local_exit(flags).
On Tue, 2007-08-28 at 12:36 -0700, Christoph Lameter wrote:
> On Tue, 28 Aug 2007, Peter Zijlstra wrote:
>
> > On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> > > H. One wild idea would be to use a priority futex for the slab lock?
> > > That would make the slow paths
On Tue, 28 Aug 2007, Mathieu Desnoyers wrote:
> Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
> must always come with the acquire or release semantic. Is there any
> cmpxchg equivalent on ia64 that would be acquire and release semantic
> free ? This implicit memory
On Tue, 28 Aug 2007, Peter Zijlstra wrote:
> On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> > H. One wild idea would be to use a priority futex for the slab lock?
> > That would make the slow paths interrupt safe without requiring interrupt
> > disable? Does a futex fit into
Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
must always come with the acquire or release semantic. Is there any
cmpxchg equivalent on ia64 that would be acquire and release semantic
free ? This implicit memory ordering in the instruction seems to be
responsible for the
On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
> H. One wild idea would be to use a priority futex for the slab lock?
> That would make the slow paths interrupt safe without requiring interrupt
> disable? Does a futex fit into the page struct?
Very much puzzled at what you
On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
H. One wild idea would be to use a priority futex for the slab lock?
That would make the slow paths interrupt safe without requiring interrupt
disable? Does a futex fit into the page struct?
Very much puzzled at what you
Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
must always come with the acquire or release semantic. Is there any
cmpxchg equivalent on ia64 that would be acquire and release semantic
free ? This implicit memory ordering in the instruction seems to be
responsible for the
On Tue, 28 Aug 2007, Peter Zijlstra wrote:
On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
H. One wild idea would be to use a priority futex for the slab lock?
That would make the slow paths interrupt safe without requiring interrupt
disable? Does a futex fit into the
On Tue, 28 Aug 2007, Mathieu Desnoyers wrote:
Ok, I just had a look at ia64 instruction set, and I fear that cmpxchg
must always come with the acquire or release semantic. Is there any
cmpxchg equivalent on ia64 that would be acquire and release semantic
free ? This implicit memory ordering
On Tue, 2007-08-28 at 12:36 -0700, Christoph Lameter wrote:
On Tue, 28 Aug 2007, Peter Zijlstra wrote:
On Mon, 2007-08-27 at 15:15 -0700, Christoph Lameter wrote:
H. One wild idea would be to use a priority futex for the slab lock?
That would make the slow paths interrupt safe
Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local
emulation. Results are not good:
slub/per cpu
1 times kmalloc(8)/kfree -> 105 cycles
1 times kmalloc(16)/kfree -> 104 cycles
1 times kmalloc(32)/kfree -> 105 cycles
1 times kmalloc(64)/kfree -> 104 cycles
1
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> Hrm, I just want to certify one thing: A lot of code paths seems to go
> to the slow path without requiring cmpxchg_local to execute at all. So
> is the slow path more likely to be triggered by the (!object),
> (!node_match) tests or by these same
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
>
> > > The slow path would require disable preemption and two interrupt disables.
> > If the slow path have to call new_slab, then yes. But it seems that not
> > every slow path must call it, so for
H. One wild idea would be to use a priority futex for the slab lock?
That would make the slow paths interrupt safe without requiring interrupt
disable? Does a futex fit into the page struct?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> > The slow path would require disable preemption and two interrupt disables.
> If the slow path have to call new_slab, then yes. But it seems that not
> every slow path must call it, so for the other slow paths, only one
> interrupt disable would be
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
>
> > > a clean solution source code wise. It also minimizes the interrupt
> > > holdoff
> > > for the non-cmpxchg_local arches. However, it means that we will have to
> > > disable interrupts twice
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> > a clean solution source code wise. It also minimizes the interrupt holdoff
> > for the non-cmpxchg_local arches. However, it means that we will have to
> > disable interrupts twice for the slow path. If that is too expensive then
> > we need a
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> I think the simplest solution may be to leave slub as done in the patch
> that we developed last week. The arch must provide a cmpxchg_local that is
> performance wise the fastest possible. On x86 this is going to be the
> cmpxchg_local on others
I think the simplest solution may be to leave slub as done in the patch
that we developed last week. The arch must provide a cmpxchg_local that is
performance wise the fastest possible. On x86 this is going to be the
cmpxchg_local on others where cmpxchg is slower than interrupt
disable/enable
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
>
> > * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > > On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> > >
> > > > So, if the fast path can be done with a preempt off, it might be doable
> > > > to
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> >
> > > So, if the fast path can be done with a preempt off, it might be doable
> > > to suffer the slow path with a per cpu lock like that.
> >
> >
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Mon, 27 Aug 2007, Peter Zijlstra wrote:
>
> > So, if the fast path can be done with a preempt off, it might be doable
> > to suffer the slow path with a per cpu lock like that.
>
> Sadly the cmpxchg_local requires local per cpu data access.
On Mon, 27 Aug 2007, Peter Zijlstra wrote:
> So, if the fast path can be done with a preempt off, it might be doable
> to suffer the slow path with a per cpu lock like that.
Sadly the cmpxchg_local requires local per cpu data access. Isnt there
some way to make this less expensive on RT?
On Tue, 2007-08-21 at 16:14 -0700, Christoph Lameter wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Changed smp_rmb() for barrier(). We are not interested in read order
> > across cpus, what we want is to be ordered wrt local interrupts only.
> > barrier() is much cheaper than
On Tue, 2007-08-21 at 16:14 -0700, Christoph Lameter wrote:
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
- Changed smp_rmb() for barrier(). We are not interested in read order
across cpus, what we want is to be ordered wrt local interrupts only.
barrier() is much cheaper than a
On Mon, 27 Aug 2007, Peter Zijlstra wrote:
So, if the fast path can be done with a preempt off, it might be doable
to suffer the slow path with a per cpu lock like that.
Sadly the cmpxchg_local requires local per cpu data access. Isnt there
some way to make this less expensive on RT? Acessing
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Mon, 27 Aug 2007, Peter Zijlstra wrote:
So, if the fast path can be done with a preempt off, it might be doable
to suffer the slow path with a per cpu lock like that.
Sadly the cmpxchg_local requires local per cpu data access. Isnt there
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Mon, 27 Aug 2007, Peter Zijlstra wrote:
So, if the fast path can be done with a preempt off, it might be doable
to suffer the slow path with a per cpu lock like that.
Sadly the
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Mon, 27 Aug 2007, Peter Zijlstra wrote:
So, if the fast path can be done with a preempt off, it might be doable
to suffer the slow path
I think the simplest solution may be to leave slub as done in the patch
that we developed last week. The arch must provide a cmpxchg_local that is
performance wise the fastest possible. On x86 this is going to be the
cmpxchg_local on others where cmpxchg is slower than interrupt
disable/enable
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
I think the simplest solution may be to leave slub as done in the patch
that we developed last week. The arch must provide a cmpxchg_local that is
performance wise the fastest possible. On x86 this is going to be the
cmpxchg_local on others
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
a clean solution source code wise. It also minimizes the interrupt holdoff
for the non-cmpxchg_local arches. However, it means that we will have to
disable interrupts twice for the slow path. If that is too expensive then
we need a
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
a clean solution source code wise. It also minimizes the interrupt
holdoff
for the non-cmpxchg_local arches. However, it means that we will have to
disable interrupts twice for the slow
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
The slow path would require disable preemption and two interrupt disables.
If the slow path have to call new_slab, then yes. But it seems that not
every slow path must call it, so for the other slow paths, only one
interrupt disable would be
H. One wild idea would be to use a priority futex for the slab lock?
That would make the slow paths interrupt safe without requiring interrupt
disable? Does a futex fit into the page struct?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
The slow path would require disable preemption and two interrupt disables.
If the slow path have to call new_slab, then yes. But it seems that not
every slow path must call it, so for the other
On Mon, 27 Aug 2007, Mathieu Desnoyers wrote:
Hrm, I just want to certify one thing: A lot of code paths seems to go
to the slow path without requiring cmpxchg_local to execute at all. So
is the slow path more likely to be triggered by the (!object),
(!node_match) tests or by these same tests
Measurements on IA64 slub w/per cpu vs slub w/per cpu/cmpxchg_local
emulation. Results are not good:
slub/per cpu
1 times kmalloc(8)/kfree - 105 cycles
1 times kmalloc(16)/kfree - 104 cycles
1 times kmalloc(32)/kfree - 105 cycles
1 times kmalloc(64)/kfree - 104 cycles
1 times
Ok so we need this.
Fix up preempt checks.
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
mm/slub.c |4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
Index: linux-2.6/mm/slub.c
===
---
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
>
> > * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> > > {
>
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
> > Then the thread could be preempted and rescheduled on a different cpu
> > between put_cpu and local_irq_save() which means that we loose the
> > state information of the kmem_cache_cpu structure.
> >
>
> Maybe am I misunderstanding something,
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> > @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> > {
> > void *prior;
> > void **object = (void *)x;
> > +
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
> @@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
> {
> void *prior;
> void **object = (void *)x;
> + unsigned long flags;
>
> +
Here is the current cmpxchg_local version that I used for testing.
Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
---
include/linux/slub_def.h | 10 +++---
mm/slub.c| 74 ---
2 files changed, 56 insertions(+), 28 deletions(-)
I can confirm Mathieus' measurement now:
Athlon64:
regular NUMA/discontig
1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) -> 79 cycles kfree -> 92 cycles
1 times kmalloc(16) -> 79 cycles kfree -> 93 cycles
1 times kmalloc(32) -> 88 cycles kfree -> 95 cycles
1
Measurements on a AMD64 2.0 GHz dual-core
In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles improvement.
1. Kmalloc: Repeatedly allocate then free test
* cmpxchg_local
On Wed, Aug 22, 2007 at 09:45:33AM -0400, Mathieu Desnoyers wrote:
> Measurements on a AMD64 2.0 GHz dual-core
>
> In this test, we seem to remove 10 cycles from the kmalloc fast path.
> On small allocations, it gives a 14% performance increase. kfree fast
> path also seems to have a 10 cycles
On Tue, Aug 21, 2007 at 06:06:19PM -0700, Christoph Lameter wrote:
> Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz
Note the P4 is a extreme case in that "unusual" instructions are
quite slow (basically anything that falls out of the trace cache). Core2
tends to be
On Tue, Aug 21, 2007 at 06:06:19PM -0700, Christoph Lameter wrote:
Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz
Note the P4 is a extreme case in that unusual instructions are
quite slow (basically anything that falls out of the trace cache). Core2
tends to be much
On Wed, Aug 22, 2007 at 09:45:33AM -0400, Mathieu Desnoyers wrote:
Measurements on a AMD64 2.0 GHz dual-core
In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles
Measurements on a AMD64 2.0 GHz dual-core
In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles improvement.
1. Kmalloc: Repeatedly allocate then free test
* cmpxchg_local
I can confirm Mathieus' measurement now:
Athlon64:
regular NUMA/discontig
1. Kmalloc: Repeatedly allocate then free test
1 times kmalloc(8) - 79 cycles kfree - 92 cycles
1 times kmalloc(16) - 79 cycles kfree - 93 cycles
1 times kmalloc(32) - 88 cycles kfree - 95 cycles
1 times
Here is the current cmpxchg_local version that I used for testing.
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
---
include/linux/slub_def.h | 10 +++---
mm/slub.c| 74 ---
2 files changed, 56 insertions(+), 28 deletions(-)
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
{
void *prior;
void **object = (void *)x;
+ unsigned long flags;
+ local_irq_save(flags);
+
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
{
void *prior;
void **object = (void *)x;
+ unsigned
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
Then the thread could be preempted and rescheduled on a different cpu
between put_cpu and local_irq_save() which means that we loose the
state information of the kmem_cache_cpu structure.
Maybe am I misunderstanding something, but
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Wed, 22 Aug 2007, Mathieu Desnoyers wrote:
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
@@ -1577,7 +1590,10 @@ static void __slab_free(struct kmem_cach
{
void
Ok so we need this.
Fix up preempt checks.
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
---
mm/slub.c |4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
Index: linux-2.6/mm/slub.c
===
--- linux-2.6.orig/mm/slub.c
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > As I am going back through the initial cmpxchg_local implementation, it
> > seems like it was executing __slab_alloc() with preemption disabled,
> > which is wrong. new_slab() is not designed for
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz
> (hyperthreading enabled). Test run with your module show only minor
> performance improvements and lots of regressions. So we must have
> cmpxchg_local to see any
Ok. Measurements vs. simple cmpxchg on a Intel(R) Pentium(R) 4 CPU 3.20GHz
(hyperthreading enabled). Test run with your module show only minor
performance improvements and lots of regressions. So we must have
cmpxchg_local to see any improvements? Some kind of a recent optimization
of cmpxchg
* Andi Kleen ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
> >
> > The measurements I get (in cycles):
> > enable interrupts (STI) disable interrupts (CLI) local
> > CMPXCHG
> > IA32 (P4)11282 26
>
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> As I am going back through the initial cmpxchg_local implementation, it
> seems like it was executing __slab_alloc() with preemption disabled,
> which is wrong. new_slab() is not designed for that.
The version I send you did not use preemption.
We
Mathieu Desnoyers <[EMAIL PROTECTED]> writes:
>
> The measurements I get (in cycles):
> enable interrupts (STI) disable interrupts (CLI) local
> CMPXCHG
> IA32 (P4)11282 26
> x86_64 AMD64 125 102
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Rounding error.. you seem to round at 0.1ms, but I keep the values in
> > cycles. The times that you get (1.1ms) seems strangely higher than
> > mine, which are under 1000 cycles on a 3GHz
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> - Rounding error.. you seem to round at 0.1ms, but I keep the values in
> cycles. The times that you get (1.1ms) seems strangely higher than
> mine, which are under 1000 cycles on a 3GHz system (less than 333ns).
> I guess there is both a ms -
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > Are you running a UP or SMP kernel ? If you run a UP kernel, the
> > cmpxchg_local and cmpxchg are identical.
>
> UP.
>
> > Oh, and if you run your tests at boot time, the alternatives code may
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> Are you running a UP or SMP kernel ? If you run a UP kernel, the
> cmpxchg_local and cmpxchg are identical.
UP.
> Oh, and if you run your tests at boot time, the alternatives code may
> have removed the lock prefix, therefore making cmpxchg and
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
> > shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
> > for the kmalloc/kfree pair (test 2).
>
>
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> * Christoph Lameter ([EMAIL PROTECTED]) wrote:
> > On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> >
> > > - Changed smp_rmb() for barrier(). We are not interested in read order
> > > across cpus, what we want is to be ordered wrt local
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> Using cmpxchg_local vs cmpxchg has a clear impact on the fast paths, as
> shown below: it saves about 60 to 70 cycles for kmalloc and 200 cycles
> for the kmalloc/kfree pair (test 2).
H.. I wonder if the AMD processors simply do the same in
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> kmalloc(8)/kfree = 112 cycles
> kmalloc(16)/kfree = 103 cycles
> kmalloc(32)/kfree = 103 cycles
> kmalloc(64)/kfree = 103 cycles
> kmalloc(128)/kfree = 112 cycles
> kmalloc(256)/kfree = 111 cycles
> kmalloc(512)/kfree = 111 cycles
>
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > SLUB Use cmpxchg() everywhere.
> >
> > It applies to "SLUB: Single atomic instruction alloc/free using
> > cmpxchg".
>
> > +++ slab/mm/slub.c 2007-08-20 18:42:28.0 -0400
> > @@ -1682,7
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> * cmpxchg_local Slub test
> kmalloc(8) = 83 cycleskfree = 363 cycles
> kmalloc(16) = 85 cycles kfree = 372 cycles
> kmalloc(32) = 92 cycles kfree = 377 cycles
> kmalloc(64) = 115 cycleskfree = 397 cycles
>
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Changed smp_rmb() for barrier(). We are not interested in read order
> > across cpus, what we want is to be ordered wrt local interrupts only.
> > barrier() is much cheaper than a rmb().
>
Reformatting...
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> Hi Christoph,
>
> If you are interested in the raw numbers:
>
> The (very basic) test module follows. Make sure you change get_cycles()
> for get_cycles_sync() if you plan to run this on x86_64.
>
> (tests taken on a 3GHz Pentium
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> SLUB Use cmpxchg() everywhere.
>
> It applies to "SLUB: Single atomic instruction alloc/free using
> cmpxchg".
> +++ slab/mm/slub.c2007-08-20 18:42:28.0 -0400
> @@ -1682,7 +1682,7 @@ redo:
>
> object[c->offset] = freelist;
>
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> - Changed smp_rmb() for barrier(). We are not interested in read order
> across cpus, what we want is to be ordered wrt local interrupts only.
> barrier() is much cheaper than a rmb().
But this means a preempt disable is required. RT users do
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
>
> > - Fixed an erroneous test in slab_free() (logic was flipped from the
> > original code when testing for slow path. It explains the wrong
> > numbers you have with big free).
>
> If you look
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> Therefore, in the test where we have separate passes for slub allocation
> and free, we hit mostly the slow path. Any particular reason for that ?
Maybe on SMP you are schedule to run on a different processor? Note that
I ran my tests at early
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> If you are interested in the raw numbers:
>
> The (very basic) test module follows. Make sure you change get_cycles()
> for get_cycles_sync() if you plan to run this on x86_64.
Which test is which? Would you be able to format this in a way that we
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
> - Fixed an erroneous test in slab_free() (logic was flipped from the
> original code when testing for slow path. It explains the wrong
> numbers you have with big free).
If you look at the numbers that I posted earlier then you will see that
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
> Ok, I played with your patch a bit, and the results are quite
> interesting:
>
...
> Summary:
>
> (tests repeated 1 times on a 3GHz Pentium 4)
> (kernel DEBUG menuconfig options are turned off)
> results are in cycles per iteration
> I did 2
Hi Christoph,
If you are interested in the raw numbers:
The (very basic) test module follows. Make sure you change get_cycles()
for get_cycles_sync() if you plan to run this on x86_64.
(tests taken on a 3GHz Pentium 4)
* slub HEAD, test 1
[ 99.774699] SLUB Performance testing
[ 99.785431]
Ok, I played with your patch a bit, and the results are quite
interesting:
SLUB use cmpxchg_local
my changes:
- Fixed an erroneous test in slab_free() (logic was flipped from the
original code when testing for slow path. It explains the wrong
numbers you have with big free).
- Use
Ok, I played with your patch a bit, and the results are quite
interesting:
SLUB use cmpxchg_local
my changes:
- Fixed an erroneous test in slab_free() (logic was flipped from the
original code when testing for slow path. It explains the wrong
numbers you have with big free).
- Use
Hi Christoph,
If you are interested in the raw numbers:
The (very basic) test module follows. Make sure you change get_cycles()
for get_cycles_sync() if you plan to run this on x86_64.
(tests taken on a 3GHz Pentium 4)
* slub HEAD, test 1
[ 99.774699] SLUB Performance testing
[ 99.785431]
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
Ok, I played with your patch a bit, and the results are quite
interesting:
...
Summary:
(tests repeated 1 times on a 3GHz Pentium 4)
(kernel DEBUG menuconfig options are turned off)
results are in cycles per iteration
I did 2 runs of
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
- Fixed an erroneous test in slab_free() (logic was flipped from the
original code when testing for slow path. It explains the wrong
numbers you have with big free).
If you look at the numbers that I posted earlier then you will see that
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
If you are interested in the raw numbers:
The (very basic) test module follows. Make sure you change get_cycles()
for get_cycles_sync() if you plan to run this on x86_64.
Which test is which? Would you be able to format this in a way that we can
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
Therefore, in the test where we have separate passes for slub allocation
and free, we hit mostly the slow path. Any particular reason for that ?
Maybe on SMP you are schedule to run on a different processor? Note that
I ran my tests at early boot
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
- Fixed an erroneous test in slab_free() (logic was flipped from the
original code when testing for slow path. It explains the wrong
numbers you have with big free).
If you look at the
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
- Changed smp_rmb() for barrier(). We are not interested in read order
across cpus, what we want is to be ordered wrt local interrupts only.
barrier() is much cheaper than a rmb().
But this means a preempt disable is required. RT users do not
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
SLUB Use cmpxchg() everywhere.
It applies to SLUB: Single atomic instruction alloc/free using
cmpxchg.
+++ slab/mm/slub.c2007-08-20 18:42:28.0 -0400
@@ -1682,7 +1682,7 @@ redo:
object[c-offset] = freelist;
- if
Reformatting...
* Mathieu Desnoyers ([EMAIL PROTECTED]) wrote:
Hi Christoph,
If you are interested in the raw numbers:
The (very basic) test module follows. Make sure you change get_cycles()
for get_cycles_sync() if you plan to run this on x86_64.
(tests taken on a 3GHz Pentium 4)
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
- Changed smp_rmb() for barrier(). We are not interested in read order
across cpus, what we want is to be ordered wrt local interrupts only.
barrier() is much cheaper than a rmb().
But this
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
* cmpxchg_local Slub test
kmalloc(8) = 83 cycleskfree = 363 cycles
kmalloc(16) = 85 cycles kfree = 372 cycles
kmalloc(32) = 92 cycles kfree = 377 cycles
kmalloc(64) = 115 cycleskfree = 397 cycles
On Tue, 21 Aug 2007, Mathieu Desnoyers wrote:
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree =
1 - 100 of 114 matches
Mail list logo