Re: [patch] x86: improved memory barrier implementation

2007-09-30 Thread Nick Piggin
On Sat, Sep 29, 2007 at 09:07:30AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 29 Sep 2007, Nick Piggin wrote:
> > > 
> > > The non-temporal stores should be basically considered to be "IO", not 
> > > any 
> > > normal memory operation.
> > 
> > Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
> > RAM apparently can go out of order too, and they are being used in the 
> > kernel
> > for some things.
> 
> I'm really saying that people to a first approximation should think "NT is 
> an IO (DMA) thing". Whether cached or not. Exactly because they do not 
> honor the normal memory ordering.
> 
> It may be worth noting that "clflush" falls under that heading too - even 
> if all the actual *writes* were done with totally normal writes, if 
> anybody does a clflush instruction, that breaks the ordering, and that 
> turns it to "DMA ordering" again - ie we're not talking about the normal 
> SMP ordering rules at all.
> 
> So all the spinlocks and all the smp_*mb() barriers have never really done 
> *anything* for those things (in particular, "smp_wmb()" has *always* 
> ignored them on i386!)

Yes that's true. I'm just noting that in some places, we do nontemporal
stores to cacheable RAM, and in that case we just have to remember to be
careful of memory ordering (especially in library functions like
copy_user_nocache where we never know exactly what it will be used for,
even if it seems safe).

Speaking of which, would it be a good idea to put an sfence before and
after this function? It seems like the safe thing to do...


> > Likewise for rep stos, apparently.
> 
> No. As far as I can tell, the fast string operations are unordered 
> *within*themselves*, but not wrt the operations around it.
> 
> In other words, you cannot depend on the ordering of stores *in* the 
> memcpy() or memset() when it is implemented by "rep movs/stos" - but that 
> is 100% equivalent to the fact that you cannot depend on the ordering even 
> when it isn't - since the "memcpy()" library routine might be copying 
> memory backwards for all you know!
> 
> The Intel memory ordering paper doesn't talk about the fast string 
> instructions (except to say that the rules it *does* speak about do not 
> hold), but the regular IA manuals do say (for example):
> 
>"Code dependent upon sequential store ordering should not use the 
> string operations for the entire data structure to be stored. Data and 
> semaphores should be separated. Order dependent code should use a 
> discrete semaphore uniquely stored to after any string operations to 
> allow correctly ordered data to be seen by all processors."
> 
> and note how it says you should just store to the semaphore. If you think 
> about it, that semahore will be involving all the memory ordering 
> requirements that we *already* depend on, so if a semaphore is sufficient 
> to order the fast string instruction, then by definition using a spinlock 
> around them must be the same thing!
> 
> In other words, by Intels architecture manual, fast string instructions 
> cannot escape a "semaphore" - but that means that they cannot escape a 
> spinlock either (since the two are exactly the same wrt memory ordering 
> rules! In other words, whenever the Intel docs say "semaphore", think 
> "mutual exclusion lock", not necessarily the kernel kind of "sleeping 
> semaphore").
> 
> But it might be good to have that explicitly mentioned in the IA memory 
> ordering thing, so I'll ask the Intel people about that. However, I'd say 
> that given our *current* documentation, string instructions may be 
> *internally* out-of-order, but they would not escape a lock.

OK, it's good to hear that from you :) The one tiny doubt I have left
is that IIRC Intel talk about lock prefixed operations as semaphore
operations sometimes? I don't suppose that would be the case here? Anyway,
if you can get them to put something in writing, that would be grand.

 
> > But this means they are already at odds with spin_unlock, unless they 
> > are enclosed with mfences everywhere they are used (of which I think 
> > most are not). So this is an existing bug in the kernel.
> 
> See above. I do not believe that it's an existing bug, but the basic point 
> that the change to "smp_rmb()" doesn't change our existing rules is true.

OK, so all that's left is to restore the locked smp_rmb case for broken
ppro and resend the patch?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-30 Thread Nick Piggin
On Sat, Sep 29, 2007 at 09:07:30AM -0700, Linus Torvalds wrote:
 
 
 On Sat, 29 Sep 2007, Nick Piggin wrote:
   
   The non-temporal stores should be basically considered to be IO, not 
   any 
   normal memory operation.
  
  Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
  RAM apparently can go out of order too, and they are being used in the 
  kernel
  for some things.
 
 I'm really saying that people to a first approximation should think NT is 
 an IO (DMA) thing. Whether cached or not. Exactly because they do not 
 honor the normal memory ordering.
 
 It may be worth noting that clflush falls under that heading too - even 
 if all the actual *writes* were done with totally normal writes, if 
 anybody does a clflush instruction, that breaks the ordering, and that 
 turns it to DMA ordering again - ie we're not talking about the normal 
 SMP ordering rules at all.
 
 So all the spinlocks and all the smp_*mb() barriers have never really done 
 *anything* for those things (in particular, smp_wmb() has *always* 
 ignored them on i386!)

Yes that's true. I'm just noting that in some places, we do nontemporal
stores to cacheable RAM, and in that case we just have to remember to be
careful of memory ordering (especially in library functions like
copy_user_nocache where we never know exactly what it will be used for,
even if it seems safe).

Speaking of which, would it be a good idea to put an sfence before and
after this function? It seems like the safe thing to do...


  Likewise for rep stos, apparently.
 
 No. As far as I can tell, the fast string operations are unordered 
 *within*themselves*, but not wrt the operations around it.
 
 In other words, you cannot depend on the ordering of stores *in* the 
 memcpy() or memset() when it is implemented by rep movs/stos - but that 
 is 100% equivalent to the fact that you cannot depend on the ordering even 
 when it isn't - since the memcpy() library routine might be copying 
 memory backwards for all you know!
 
 The Intel memory ordering paper doesn't talk about the fast string 
 instructions (except to say that the rules it *does* speak about do not 
 hold), but the regular IA manuals do say (for example):
 
Code dependent upon sequential store ordering should not use the 
 string operations for the entire data structure to be stored. Data and 
 semaphores should be separated. Order dependent code should use a 
 discrete semaphore uniquely stored to after any string operations to 
 allow correctly ordered data to be seen by all processors.
 
 and note how it says you should just store to the semaphore. If you think 
 about it, that semahore will be involving all the memory ordering 
 requirements that we *already* depend on, so if a semaphore is sufficient 
 to order the fast string instruction, then by definition using a spinlock 
 around them must be the same thing!
 
 In other words, by Intels architecture manual, fast string instructions 
 cannot escape a semaphore - but that means that they cannot escape a 
 spinlock either (since the two are exactly the same wrt memory ordering 
 rules! In other words, whenever the Intel docs say semaphore, think 
 mutual exclusion lock, not necessarily the kernel kind of sleeping 
 semaphore).
 
 But it might be good to have that explicitly mentioned in the IA memory 
 ordering thing, so I'll ask the Intel people about that. However, I'd say 
 that given our *current* documentation, string instructions may be 
 *internally* out-of-order, but they would not escape a lock.

OK, it's good to hear that from you :) The one tiny doubt I have left
is that IIRC Intel talk about lock prefixed operations as semaphore
operations sometimes? I don't suppose that would be the case here? Anyway,
if you can get them to put something in writing, that would be grand.

 
  But this means they are already at odds with spin_unlock, unless they 
  are enclosed with mfences everywhere they are used (of which I think 
  most are not). So this is an existing bug in the kernel.
 
 See above. I do not believe that it's an existing bug, but the basic point 
 that the change to smp_rmb() doesn't change our existing rules is true.

OK, so all that's left is to restore the locked smp_rmb case for broken
ppro and resend the patch?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Dave Jones
On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote:
 
 > > Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
 > > can also be a simple barrier on i386 too.
 > 
 > The IDT Winchip can do SMP apparently.

>From the Winchip3 (which was the final winchip) specs..

"The IDT WinChip 3 processor also omits the software interface
 to the Intel-proprietary symmetric multiprocessing support: APIC.
 This bus function is omitted since the target market for the
 IDT WinChip 3 processor is typical desktop systems (which
 do not support APIC multiprocessing)."

It didn't offer any alternative DIY-SMP either (or at least
none that's documented).

Centaur only became SMP capable with some of the C3 Nehemiah's
a year or two back.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Linus Torvalds


On Sat, 29 Sep 2007, Nick Piggin wrote:
> > 
> > The non-temporal stores should be basically considered to be "IO", not any 
> > normal memory operation.
> 
> Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
> RAM apparently can go out of order too, and they are being used in the kernel
> for some things.

I'm really saying that people to a first approximation should think "NT is 
an IO (DMA) thing". Whether cached or not. Exactly because they do not 
honor the normal memory ordering.

It may be worth noting that "clflush" falls under that heading too - even 
if all the actual *writes* were done with totally normal writes, if 
anybody does a clflush instruction, that breaks the ordering, and that 
turns it to "DMA ordering" again - ie we're not talking about the normal 
SMP ordering rules at all.

So all the spinlocks and all the smp_*mb() barriers have never really done 
*anything* for those things (in particular, "smp_wmb()" has *always* 
ignored them on i386!)

> Likewise for rep stos, apparently.

No. As far as I can tell, the fast string operations are unordered 
*within*themselves*, but not wrt the operations around it.

In other words, you cannot depend on the ordering of stores *in* the 
memcpy() or memset() when it is implemented by "rep movs/stos" - but that 
is 100% equivalent to the fact that you cannot depend on the ordering even 
when it isn't - since the "memcpy()" library routine might be copying 
memory backwards for all you know!

The Intel memory ordering paper doesn't talk about the fast string 
instructions (except to say that the rules it *does* speak about do not 
hold), but the regular IA manuals do say (for example):

   "Code dependent upon sequential store ordering should not use the 
string operations for the entire data structure to be stored. Data and 
semaphores should be separated. Order dependent code should use a 
discrete semaphore uniquely stored to after any string operations to 
allow correctly ordered data to be seen by all processors."

and note how it says you should just store to the semaphore. If you think 
about it, that semahore will be involving all the memory ordering 
requirements that we *already* depend on, so if a semaphore is sufficient 
to order the fast string instruction, then by definition using a spinlock 
around them must be the same thing!

In other words, by Intels architecture manual, fast string instructions 
cannot escape a "semaphore" - but that means that they cannot escape a 
spinlock either (since the two are exactly the same wrt memory ordering 
rules! In other words, whenever the Intel docs say "semaphore", think 
"mutual exclusion lock", not necessarily the kernel kind of "sleeping 
semaphore").

But it might be good to have that explicitly mentioned in the IA memory 
ordering thing, so I'll ask the Intel people about that. However, I'd say 
that given our *current* documentation, string instructions may be 
*internally* out-of-order, but they would not escape a lock.

> But this means they are already at odds with spin_unlock, unless they 
> are enclosed with mfences everywhere they are used (of which I think 
> most are not). So this is an existing bug in the kernel.

See above. I do not believe that it's an existing bug, but the basic point 
that the change to "smp_rmb()" doesn't change our existing rules is true.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Nick Piggin
On Fri, Sep 28, 2007 at 06:18:31PM +0100, Alan Cox wrote:
> > on the broken ppro stores config option if you just tell me what should
> > be there (again, remember that my patch isn't actually changing anything
> > already there except for smp_rmb side).
> 
> The PPro needs rmb to ensure a store doesn't go for a walk on the wild
> side and pass the read especially when we are dealing with device space.

smp_rmb (ie. the non device space version) needs to order stores too?
OK, that's definitely something we'll need a locked op for. I can't see how
it could possibly close all holes, though: the level at which memory ordering
dependencies operate in lockless code is way above what the hardware can
know about.

For any 2 given code sequences
store 1 -> X
store 2 -> Y
and
load Y
load X

There may be a requirement to have wmb and rmb respectively for
correctness, or it may be perfectly correct without ordering. Unless the
issue is *purely* that store and/or loads can erroneously execute out
of order, then adding magic to barrier primitives cannot fix all cases
(and the errata certainly looked spookier than simply out of order
problems).


> The rest of the stuff is a little vague which makes me nervous when
> considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at
> the moment the PPro is effectively treated as an out of order cpu and so
> shouldn't hit anything occassionally that a PPC wouldn't hit every time.

Well, it's all a little vague to me ;) (maybe you were privy to info that you
aren't allowed to talk about...). IIRC the errata I could find, mentioned
concurrent stores to a single variable as being problematic, and nothing
about memory ordering, or loads, at all.

One difference with a normal weakly ordered machine, is that the normal
one wouldn't require rmb to order loads vs stores, for example.

Anyway, after all that blowing of smoke (I'm just genuinely curious), I have
no problems with any fixups for ppro and defer to your greater understanding
of the issues. However, all that can go under the config variable we have
for that purpose. Let me know what you need? Is it just a matter of putting
the dummy lock'ed op back in smp_rmb? You sure you don't also need
smp_wmb to order stores? I can cook up a patch to do either.


> > > - and for modern processors its still not remotely clear your patch is
> > > correct because of NT stores.
> > 
> > No. We already _have_ to rely on this model for barriers 
> 
> Well Linus has dealt with the question of NT stores for us now...

He actually didn't, quite ;) (see other post)


> Given this isn't an issue on 64bit I'm inclined to argue that only 64bit
> behaviour should be changed at this point.

That's the easy way out, but I think it creates more problems down the
line. It diverges the code base and testing base. I think it is pretty simple
(and serves as good documentation) to extend the use of the ppro ifdef
to do what we need here, rather than implicitly rely on the barriers being
stronger than required for non-broken implementations.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Nick Piggin
On Fri, Sep 28, 2007 at 09:15:06AM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 28 Sep 2007, Alan Cox wrote:
> > 
> > However 
> > - You've not shown the patch has any performance gain
> 
> It would be nice to see this.

Actually, in a userspace test I have (which actually does enough
work to trigger out of order operations on my core2 but is otherwise
pretty trivial), lfence takes 13 cycles, sfence takes 40 (neither of
which actually solve the problem of load vs store ordering, but at
least they might be operating on a slightly utilised memory subsystem
rather than the stupidest possible microbenchmark).

The dummy lock op takes around 75 cycles (of course, the core2 would
always use the fences, but older CPUs will not and will be worse at the
lock op too, probably).

I suppose these could take significantly longer if there are uncached
memory operations and such (I wasn't doing any significant amount
of IO) -- I can't be sure, though.

So it isn't much, but it could be helpful. If the code is important enough
to go without locks and instead use complex barriers, it might easily be
worth saving this kind of cycles on.


> > - You've probably broken Pentium Pro
> 
> Probably not a big deal, but yeah, we should have that broken-ppro option.

Will add, I'll ask Alan to specify what he'd like to see there.


> > - and for modern processors its still not remotely clear your patch is
> > correct because of NT stores.
> 
> This one I disagree with. The *old* code doesn't honor NT stores *either*. 
> 
> The fact is, if you use NT stores and then depend on ordering, it has 
> nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use 
> the DMA barriers (ie the non-smp ones).
> 
> The non-temporal stores should be basically considered to be "IO", not any 
> normal memory operation.

Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
RAM apparently can go out of order too, and they are being used in the kernel
for some things. Likewise for rep stos, apparently. But this means they are
already at odds with spin_unlock, unless they are enclosed with mfences 
everywhere they are used (of which I think most are not). So this is an
existing bug in the kernel.

So again the question comes up -- do we promote these kinds of stores
to be regular x86 citizens, keep the strong memory barriers as they are, and
eat 40 cycles with an sfence before each spin_unlock store; or do we fix the
few users of non-temporal stores and continue with the model we've always
had where stores are in-order? Or I guess the implicit option is to do nothing
until some poor bastard has the pleasure of having to debug some problem.

Anyway, just keep in mind that this patch is not making any changes
which are not already fundamentally broken. Sure, it might happen to
cause more actual individual cases to break, but if they just happened
to be using real locking instead of explicit barriers, they would be
broken anyway, right? (IOW, any new breakage is already conceptually
broken, even if OK in practice due to the overstrictness of our current
barriers).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Nick Piggin
On Fri, Sep 28, 2007 at 09:15:06AM -0700, Linus Torvalds wrote:
 
 
 On Fri, 28 Sep 2007, Alan Cox wrote:
  
  However 
  - You've not shown the patch has any performance gain
 
 It would be nice to see this.

Actually, in a userspace test I have (which actually does enough
work to trigger out of order operations on my core2 but is otherwise
pretty trivial), lfence takes 13 cycles, sfence takes 40 (neither of
which actually solve the problem of load vs store ordering, but at
least they might be operating on a slightly utilised memory subsystem
rather than the stupidest possible microbenchmark).

The dummy lock op takes around 75 cycles (of course, the core2 would
always use the fences, but older CPUs will not and will be worse at the
lock op too, probably).

I suppose these could take significantly longer if there are uncached
memory operations and such (I wasn't doing any significant amount
of IO) -- I can't be sure, though.

So it isn't much, but it could be helpful. If the code is important enough
to go without locks and instead use complex barriers, it might easily be
worth saving this kind of cycles on.


  - You've probably broken Pentium Pro
 
 Probably not a big deal, but yeah, we should have that broken-ppro option.

Will add, I'll ask Alan to specify what he'd like to see there.


  - and for modern processors its still not remotely clear your patch is
  correct because of NT stores.
 
 This one I disagree with. The *old* code doesn't honor NT stores *either*. 
 
 The fact is, if you use NT stores and then depend on ordering, it has 
 nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use 
 the DMA barriers (ie the non-smp ones).
 
 The non-temporal stores should be basically considered to be IO, not any 
 normal memory operation.

Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
RAM apparently can go out of order too, and they are being used in the kernel
for some things. Likewise for rep stos, apparently. But this means they are
already at odds with spin_unlock, unless they are enclosed with mfences 
everywhere they are used (of which I think most are not). So this is an
existing bug in the kernel.

So again the question comes up -- do we promote these kinds of stores
to be regular x86 citizens, keep the strong memory barriers as they are, and
eat 40 cycles with an sfence before each spin_unlock store; or do we fix the
few users of non-temporal stores and continue with the model we've always
had where stores are in-order? Or I guess the implicit option is to do nothing
until some poor bastard has the pleasure of having to debug some problem.

Anyway, just keep in mind that this patch is not making any changes
which are not already fundamentally broken. Sure, it might happen to
cause more actual individual cases to break, but if they just happened
to be using real locking instead of explicit barriers, they would be
broken anyway, right? (IOW, any new breakage is already conceptually
broken, even if OK in practice due to the overstrictness of our current
barriers).
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Nick Piggin
On Fri, Sep 28, 2007 at 06:18:31PM +0100, Alan Cox wrote:
  on the broken ppro stores config option if you just tell me what should
  be there (again, remember that my patch isn't actually changing anything
  already there except for smp_rmb side).
 
 The PPro needs rmb to ensure a store doesn't go for a walk on the wild
 side and pass the read especially when we are dealing with device space.

smp_rmb (ie. the non device space version) needs to order stores too?
OK, that's definitely something we'll need a locked op for. I can't see how
it could possibly close all holes, though: the level at which memory ordering
dependencies operate in lockless code is way above what the hardware can
know about.

For any 2 given code sequences
store 1 - X
store 2 - Y
and
load Y
load X

There may be a requirement to have wmb and rmb respectively for
correctness, or it may be perfectly correct without ordering. Unless the
issue is *purely* that store and/or loads can erroneously execute out
of order, then adding magic to barrier primitives cannot fix all cases
(and the errata certainly looked spookier than simply out of order
problems).


 The rest of the stuff is a little vague which makes me nervous when
 considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at
 the moment the PPro is effectively treated as an out of order cpu and so
 shouldn't hit anything occassionally that a PPC wouldn't hit every time.

Well, it's all a little vague to me ;) (maybe you were privy to info that you
aren't allowed to talk about...). IIRC the errata I could find, mentioned
concurrent stores to a single variable as being problematic, and nothing
about memory ordering, or loads, at all.

One difference with a normal weakly ordered machine, is that the normal
one wouldn't require rmb to order loads vs stores, for example.

Anyway, after all that blowing of smoke (I'm just genuinely curious), I have
no problems with any fixups for ppro and defer to your greater understanding
of the issues. However, all that can go under the config variable we have
for that purpose. Let me know what you need? Is it just a matter of putting
the dummy lock'ed op back in smp_rmb? You sure you don't also need
smp_wmb to order stores? I can cook up a patch to do either.


   - and for modern processors its still not remotely clear your patch is
   correct because of NT stores.
  
  No. We already _have_ to rely on this model for barriers 
 
 Well Linus has dealt with the question of NT stores for us now...

He actually didn't, quite ;) (see other post)


 Given this isn't an issue on 64bit I'm inclined to argue that only 64bit
 behaviour should be changed at this point.

That's the easy way out, but I think it creates more problems down the
line. It diverges the code base and testing base. I think it is pretty simple
(and serves as good documentation) to extend the use of the ppro ifdef
to do what we need here, rather than implicitly rely on the barriers being
stronger than required for non-broken implementations.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Linus Torvalds


On Sat, 29 Sep 2007, Nick Piggin wrote:
  
  The non-temporal stores should be basically considered to be IO, not any 
  normal memory operation.
 
 Maybe you're thinking of uncached / WC? Non-temporal stores to cacheable
 RAM apparently can go out of order too, and they are being used in the kernel
 for some things.

I'm really saying that people to a first approximation should think NT is 
an IO (DMA) thing. Whether cached or not. Exactly because they do not 
honor the normal memory ordering.

It may be worth noting that clflush falls under that heading too - even 
if all the actual *writes* were done with totally normal writes, if 
anybody does a clflush instruction, that breaks the ordering, and that 
turns it to DMA ordering again - ie we're not talking about the normal 
SMP ordering rules at all.

So all the spinlocks and all the smp_*mb() barriers have never really done 
*anything* for those things (in particular, smp_wmb() has *always* 
ignored them on i386!)

 Likewise for rep stos, apparently.

No. As far as I can tell, the fast string operations are unordered 
*within*themselves*, but not wrt the operations around it.

In other words, you cannot depend on the ordering of stores *in* the 
memcpy() or memset() when it is implemented by rep movs/stos - but that 
is 100% equivalent to the fact that you cannot depend on the ordering even 
when it isn't - since the memcpy() library routine might be copying 
memory backwards for all you know!

The Intel memory ordering paper doesn't talk about the fast string 
instructions (except to say that the rules it *does* speak about do not 
hold), but the regular IA manuals do say (for example):

   Code dependent upon sequential store ordering should not use the 
string operations for the entire data structure to be stored. Data and 
semaphores should be separated. Order dependent code should use a 
discrete semaphore uniquely stored to after any string operations to 
allow correctly ordered data to be seen by all processors.

and note how it says you should just store to the semaphore. If you think 
about it, that semahore will be involving all the memory ordering 
requirements that we *already* depend on, so if a semaphore is sufficient 
to order the fast string instruction, then by definition using a spinlock 
around them must be the same thing!

In other words, by Intels architecture manual, fast string instructions 
cannot escape a semaphore - but that means that they cannot escape a 
spinlock either (since the two are exactly the same wrt memory ordering 
rules! In other words, whenever the Intel docs say semaphore, think 
mutual exclusion lock, not necessarily the kernel kind of sleeping 
semaphore).

But it might be good to have that explicitly mentioned in the IA memory 
ordering thing, so I'll ask the Intel people about that. However, I'd say 
that given our *current* documentation, string instructions may be 
*internally* out-of-order, but they would not escape a lock.

 But this means they are already at odds with spin_unlock, unless they 
 are enclosed with mfences everywhere they are used (of which I think 
 most are not). So this is an existing bug in the kernel.

See above. I do not believe that it's an existing bug, but the basic point 
that the change to smp_rmb() doesn't change our existing rules is true.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-29 Thread Dave Jones
On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote:
 
   Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
   can also be a simple barrier on i386 too.
  
  The IDT Winchip can do SMP apparently.

From the Winchip3 (which was the final winchip) specs..

The IDT WinChip 3 processor also omits the software interface
 to the Intel-proprietary symmetric multiprocessing support: APIC.
 This bus function is omitted since the target market for the
 IDT WinChip 3 processor is typical desktop systems (which
 do not support APIC multiprocessing).

It didn't offer any alternative DIY-SMP either (or at least
none that's documented).

Centaur only became SMP capable with some of the C3 Nehemiah's
a year or two back.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Alan Cox
> on the broken ppro stores config option if you just tell me what should
> be there (again, remember that my patch isn't actually changing anything
> already there except for smp_rmb side).

The PPro needs rmb to ensure a store doesn't go for a walk on the wild
side and pass the read especially when we are dealing with device space.
The rest of the stuff is a little vague which makes me nervous when
considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at
the moment the PPro is effectively treated as an out of order cpu and so
shouldn't hit anything occassionally that a PPC wouldn't hit every time.

> > - and for modern processors its still not remotely clear your patch is
> > correct because of NT stores.
> 
> No. We already _have_ to rely on this model for barriers 

Well Linus has dealt with the question of NT stores for us now...

Given this isn't an issue on 64bit I'm inclined to argue that only 64bit
behaviour should be changed at this point.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Nick Piggin
On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote:
> > The only alternative is to assume a weak memory model, and add the required
> > barriers to spin_unlock -- something that has been explicitly avoided, but
> 
> We have the barriers in spin_unlock already for Pentium Pro and IDT
> Winchip systems. The Winchip explicitly supports out of order store (and
> was about 10-20% faster with it set this way), the PPro has some annoying
> little store ordering bugs which is why we xchgb out of spin_unlock in
> those cases.

And we still have the correct smp_wmb barrier for OOOSTORE systems. And
we never did any smp_wmb tricks for broken ppros in barrier code.

IOW, my patch is a complete noop for both (it changes smp_rmb to be a
noop, of course, but AFAIKS that shouldn't be relevant for either
problem -- and if it was, then you need to add special ifdefs for
them rather than make everyone else eat proverbial shit, like we do
for spin_unlock).



> > The complaint that broken old ppros get more broken seems to be bogus as wel
l:
> > wmb on i386 never catered to the old ppro to start with. The errata seem to
>
> There are only a couple of awkward cases with the PPro and we carefully
> work around them. The other half of the work is the flush_write_buffers.

Note that for IO, rmb/wmb/mb remain as they are.


> > Buggy ppros: I have no interest in, given the vague errata and apparently
> > unfixable problems with lockless code; anybody is welcome to try adding
> > whatever they like to help them work, but again please don't confuse that
> > with having anything to do with this patch...
>
> PPro was fixed, made to work and brutally tested on quad PPro. There are
> cases that wouldn't work (PPro + AGP which never happens) and there are
> cases that need specific handling (PPro + Voodoo card and some other 3D
> bits) which are handled currently in the X server. The rest is fine.

Aside from the fact that we currently don't do anything for ppro in
smp_wmb *anyway*, I really don't think adding stuff there can fix it:
you can fix critical sections, because you can be sure that all stores
to the same memory regions are strictly fenced inside the locks, except
for the ops on the locks themselves. For those stores, you obviously can
work around the problems by using lock ops.

However, for lockless code, it is entirely possible to do conflicting
unlocked stores to the same location. Memory barriers are never going to
stop this, because they do not provide any synchronisation. They may make
some things less likely or some actual patterns effectively safe...

Anyway, I'm not arguing against adding "stuff" here for it. I repeat:
that isn't concerned with this patch.


> > Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
> > can also be a simple barrier on i386 too.
> 
> The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP
> winchip board so I don't think it matters and even if one existed we
> wouldn't support it anyway

OK, so in theory we could make smp_wmb just a simple barrier, however the
code for ooostore there is under a config option, so there is no harm in
keeping it around really.


> However 
> - You've not shown the patch has any performance gain

Can I theorise that it will reduce icache misses and be done with? ;)


> - You've probably broken Pentium Pro

Doubt it. It would be really interesting to have an _exact_ trace of
how it breaks that isn't fundamentally broken already. I suspect that
information will never come out of intel. But again, if you think we
need some stronger stuff in smp_wmb, I can do a patch that depends
on the broken ppro stores config option if you just tell me what should
be there (again, remember that my patch isn't actually changing anything
already there except for smp_rmb side).


> - and for modern processors its still not remotely clear your patch is
> correct because of NT stores.

No. We already _have_ to rely on this model for barriers because we
have a lock-less spin_unlock. We can either: put explicit mfences around
all stores that aren't executed inorder WRT other stores; or make
spin_unlock use the lock prefix.

We have pretty explicitly decided to do the former, and the fact that
subsequent patches to implement these exotic stores have broken the
memory ordering model does in no way invalidate this patch, IMO.


> 
> So NAK

Thanks :) Very nice feedback (I wasn't having a gentle dig at you,
honest!)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Linus Torvalds


On Fri, 28 Sep 2007, Alan Cox wrote:
> 
> However 
> - You've not shown the patch has any performance gain

It would be nice to see this.

> - You've probably broken Pentium Pro

Probably not a big deal, but yeah, we should have that broken-ppro option.

> - and for modern processors its still not remotely clear your patch is
> correct because of NT stores.

This one I disagree with. The *old* code doesn't honor NT stores *either*. 

The fact is, if you use NT stores and then depend on ordering, it has 
nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use 
the DMA barriers (ie the non-smp ones).

The non-temporal stores should be basically considered to be "IO", not any 
normal memory operation.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Alan Cox
> The only alternative is to assume a weak memory model, and add the required
> barriers to spin_unlock -- something that has been explicitly avoided, but

We have the barriers in spin_unlock already for Pentium Pro and IDT
Winchip systems. The Winchip explicitly supports out of order store (and
was about 10-20% faster with it set this way), the PPro has some annoying
little store ordering bugs which is why we xchgb out of spin_unlock in
those cases.

> The complaint that broken old ppros get more broken seems to be bogus as well:
> wmb on i386 never catered to the old ppro to start with. The errata seem to

There are only a couple of awkward cases with the PPro and we carefully
work around them. The other half of the work is the flush_write_buffers.

> Buggy ppros: I have no interest in, given the vague errata and apparently
> unfixable problems with lockless code; anybody is welcome to try adding
> whatever they like to help them work, but again please don't confuse that
> with having anything to do with this patch...

PPro was fixed, made to work and brutally tested on quad PPro. There are
cases that wouldn't work (PPro + AGP which never happens) and there are
cases that need specific handling (PPro + Voodoo card and some other 3D
bits) which are handled currently in the X server. The rest is fine.

> Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
> can also be a simple barrier on i386 too.

The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP
winchip board so I don't think it matters and even if one existed we
wouldn't support it anyway

However 
- You've not shown the patch has any performance gain
- You've probably broken Pentium Pro
- and for modern processors its still not remotely clear your patch is
correct because of NT stores.

So NAK

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] x86: improved memory barrier implementation

2007-09-28 Thread Nick Piggin
According to latest memory ordering specification documents from Intel and
AMD, both manufacturers are committed to in-order loads from cacheable memory
for the x86 architecture. Hence, smp_rmb() may be a simple barrier.

Also according to those documents, and according to existing practice in Linux
(eg. spin_unlock doesn't enforce ordering), stores to cacheable memory are
visible in program order too. Hence, smp_wmb() may be a simple barrier.

http://developer.intel.com/products/processor/manuals/318147.pdf
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

Whether or not this noticably helps performance anywhere is not known. My
primary motivation for this is to have a canonical barrier implementation for
x86 architecture.

There is some controversy about whether this is valid in the presence of
string operations or non-temporal stores etc. which may use different memory
ordering rules. However, those would be _already_ broken by non-ordering
spinlocks and, as such, those operations must carry their own ordering code
so they appear to execute in-order (though many places don't seem to, eg. rep
stosb in copy*user*).

The only alternative is to assume a weak memory model, and add the required
barriers to spin_unlock -- something that has been explicitly avoided, but
_may_ be a better way to go? However, unless my logic is broken, it is nonsense
to have our barrier primitives assuming a weakly ordered model but our
spinlocks to assume strongly ordered (weird bugs in simple spinlocked code, but
complex lockless code works fine?!).

The complaint that broken old ppros get more broken seems to be bogus as well:
wmb on i386 never catered to the old ppro to start with. The errata seem to
indicate that lockless code is going to be broken by definition regardless of
anything we do, short of adding a lock prefix to all potentially concurrent
memory operations. As a bandaid to reduce the chance of a bug, it may be
valid to do a dummy lock op here -- the public docs are too vague to really
know -- however that is a separate patch and completely independent of this
one.

So: rather than a vague nack, I'd really like to either have a hole shown
in my logic, or a counter proposal for a plan to "unconstrain" the exotic
memory ops like rep stos / nt operations, and actually have spinlocks and
mbs do the required barriers.

Buggy ppros: I have no interest in, given the vague errata and apparently
unfixable problems with lockless code; anybody is welcome to try adding
whatever they like to help them work, but again please don't confuse that
with having anything to do with this patch...

Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
can also be a simple barrier on i386 too.

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>

---
Index: linux-2.6/include/asm-i386/system.h
===
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -281,12 +281,12 @@ static inline unsigned long get_limit(un
but make it already an possibility. */
 #define wmb() alternative("lock; addl $0,0(%%esp)", "sfence", X86_FEATURE_XMM)
 #else
-#define wmb()  __asm__ __volatile__ ("": : :"memory")
+#define wmb() barrier()
 #endif
 
 #ifdef CONFIG_SMP
 #define smp_mb()   mb()
-#define smp_rmb()  rmb()
+#define smp_rmb()  barrier()
 #define smp_wmb()  wmb()
 #define smp_read_barrier_depends() read_barrier_depends()
 #define set_mb(var, value) do { (void) xchg(, value); } while (0)
Index: linux-2.6/include/asm-x86_64/system.h
===
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -141,8 +141,8 @@ static inline void write_cr8(unsigned lo
 
 #ifdef CONFIG_SMP
 #define smp_mb()   mb()
-#define smp_rmb()  rmb()
-#define smp_wmb()  wmb()
+#define smp_rmb()  barrier()
+#define smp_wmb()  barrier()
 #define smp_read_barrier_depends() do {} while(0)
 #else
 #define smp_mb()   barrier()
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] x86: improved memory barrier implementation

2007-09-28 Thread Nick Piggin
According to latest memory ordering specification documents from Intel and
AMD, both manufacturers are committed to in-order loads from cacheable memory
for the x86 architecture. Hence, smp_rmb() may be a simple barrier.

Also according to those documents, and according to existing practice in Linux
(eg. spin_unlock doesn't enforce ordering), stores to cacheable memory are
visible in program order too. Hence, smp_wmb() may be a simple barrier.

http://developer.intel.com/products/processor/manuals/318147.pdf
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf

Whether or not this noticably helps performance anywhere is not known. My
primary motivation for this is to have a canonical barrier implementation for
x86 architecture.

There is some controversy about whether this is valid in the presence of
string operations or non-temporal stores etc. which may use different memory
ordering rules. However, those would be _already_ broken by non-ordering
spinlocks and, as such, those operations must carry their own ordering code
so they appear to execute in-order (though many places don't seem to, eg. rep
stosb in copy*user*).

The only alternative is to assume a weak memory model, and add the required
barriers to spin_unlock -- something that has been explicitly avoided, but
_may_ be a better way to go? However, unless my logic is broken, it is nonsense
to have our barrier primitives assuming a weakly ordered model but our
spinlocks to assume strongly ordered (weird bugs in simple spinlocked code, but
complex lockless code works fine?!).

The complaint that broken old ppros get more broken seems to be bogus as well:
wmb on i386 never catered to the old ppro to start with. The errata seem to
indicate that lockless code is going to be broken by definition regardless of
anything we do, short of adding a lock prefix to all potentially concurrent
memory operations. As a bandaid to reduce the chance of a bug, it may be
valid to do a dummy lock op here -- the public docs are too vague to really
know -- however that is a separate patch and completely independent of this
one.

So: rather than a vague nack, I'd really like to either have a hole shown
in my logic, or a counter proposal for a plan to unconstrain the exotic
memory ops like rep stos / nt operations, and actually have spinlocks and
mbs do the required barriers.

Buggy ppros: I have no interest in, given the vague errata and apparently
unfixable problems with lockless code; anybody is welcome to try adding
whatever they like to help them work, but again please don't confuse that
with having anything to do with this patch...

Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
can also be a simple barrier on i386 too.

Signed-off-by: Nick Piggin [EMAIL PROTECTED]

---
Index: linux-2.6/include/asm-i386/system.h
===
--- linux-2.6.orig/include/asm-i386/system.h
+++ linux-2.6/include/asm-i386/system.h
@@ -281,12 +281,12 @@ static inline unsigned long get_limit(un
but make it already an possibility. */
 #define wmb() alternative(lock; addl $0,0(%%esp), sfence, X86_FEATURE_XMM)
 #else
-#define wmb()  __asm__ __volatile__ (: : :memory)
+#define wmb() barrier()
 #endif
 
 #ifdef CONFIG_SMP
 #define smp_mb()   mb()
-#define smp_rmb()  rmb()
+#define smp_rmb()  barrier()
 #define smp_wmb()  wmb()
 #define smp_read_barrier_depends() read_barrier_depends()
 #define set_mb(var, value) do { (void) xchg(var, value); } while (0)
Index: linux-2.6/include/asm-x86_64/system.h
===
--- linux-2.6.orig/include/asm-x86_64/system.h
+++ linux-2.6/include/asm-x86_64/system.h
@@ -141,8 +141,8 @@ static inline void write_cr8(unsigned lo
 
 #ifdef CONFIG_SMP
 #define smp_mb()   mb()
-#define smp_rmb()  rmb()
-#define smp_wmb()  wmb()
+#define smp_rmb()  barrier()
+#define smp_wmb()  barrier()
 #define smp_read_barrier_depends() do {} while(0)
 #else
 #define smp_mb()   barrier()
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Alan Cox
 The only alternative is to assume a weak memory model, and add the required
 barriers to spin_unlock -- something that has been explicitly avoided, but

We have the barriers in spin_unlock already for Pentium Pro and IDT
Winchip systems. The Winchip explicitly supports out of order store (and
was about 10-20% faster with it set this way), the PPro has some annoying
little store ordering bugs which is why we xchgb out of spin_unlock in
those cases.

 The complaint that broken old ppros get more broken seems to be bogus as well:
 wmb on i386 never catered to the old ppro to start with. The errata seem to

There are only a couple of awkward cases with the PPro and we carefully
work around them. The other half of the work is the flush_write_buffers.

 Buggy ppros: I have no interest in, given the vague errata and apparently
 unfixable problems with lockless code; anybody is welcome to try adding
 whatever they like to help them work, but again please don't confuse that
 with having anything to do with this patch...

PPro was fixed, made to work and brutally tested on quad PPro. There are
cases that wouldn't work (PPro + AGP which never happens) and there are
cases that need specific handling (PPro + Voodoo card and some other 3D
bits) which are handled currently in the X server. The rest is fine.

 Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
 can also be a simple barrier on i386 too.

The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP
winchip board so I don't think it matters and even if one existed we
wouldn't support it anyway

However 
- You've not shown the patch has any performance gain
- You've probably broken Pentium Pro
- and for modern processors its still not remotely clear your patch is
correct because of NT stores.

So NAK

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Alan Cox
 on the broken ppro stores config option if you just tell me what should
 be there (again, remember that my patch isn't actually changing anything
 already there except for smp_rmb side).

The PPro needs rmb to ensure a store doesn't go for a walk on the wild
side and pass the read especially when we are dealing with device space.
The rest of the stuff is a little vague which makes me nervous when
considering relaxing the PPro case for smp_rmb. With smp_rmb as it is at
the moment the PPro is effectively treated as an out of order cpu and so
shouldn't hit anything occassionally that a PPC wouldn't hit every time.

  - and for modern processors its still not remotely clear your patch is
  correct because of NT stores.
 
 No. We already _have_ to rely on this model for barriers 

Well Linus has dealt with the question of NT stores for us now...

Given this isn't an issue on 64bit I'm inclined to argue that only 64bit
behaviour should be changed at this point.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Nick Piggin
On Fri, Sep 28, 2007 at 05:07:19PM +0100, Alan Cox wrote:
  The only alternative is to assume a weak memory model, and add the required
  barriers to spin_unlock -- something that has been explicitly avoided, but
 
 We have the barriers in spin_unlock already for Pentium Pro and IDT
 Winchip systems. The Winchip explicitly supports out of order store (and
 was about 10-20% faster with it set this way), the PPro has some annoying
 little store ordering bugs which is why we xchgb out of spin_unlock in
 those cases.

And we still have the correct smp_wmb barrier for OOOSTORE systems. And
we never did any smp_wmb tricks for broken ppros in barrier code.

IOW, my patch is a complete noop for both (it changes smp_rmb to be a
noop, of course, but AFAIKS that shouldn't be relevant for either
problem -- and if it was, then you need to add special ifdefs for
them rather than make everyone else eat proverbial shit, like we do
for spin_unlock).



  The complaint that broken old ppros get more broken seems to be bogus as wel
l:
  wmb on i386 never catered to the old ppro to start with. The errata seem to

 There are only a couple of awkward cases with the PPro and we carefully
 work around them. The other half of the work is the flush_write_buffers.

Note that for IO, rmb/wmb/mb remain as they are.


  Buggy ppros: I have no interest in, given the vague errata and apparently
  unfixable problems with lockless code; anybody is welcome to try adding
  whatever they like to help them work, but again please don't confuse that
  with having anything to do with this patch...

 PPro was fixed, made to work and brutally tested on quad PPro. There are
 cases that wouldn't work (PPro + AGP which never happens) and there are
 cases that need specific handling (PPro + Voodoo card and some other 3D
 bits) which are handled currently in the X server. The rest is fine.

Aside from the fact that we currently don't do anything for ppro in
smp_wmb *anyway*, I really don't think adding stuff there can fix it:
you can fix critical sections, because you can be sure that all stores
to the same memory regions are strictly fenced inside the locks, except
for the ops on the locks themselves. For those stores, you obviously can
work around the problems by using lock ops.

However, for lockless code, it is entirely possible to do conflicting
unlocked stores to the same location. Memory barriers are never going to
stop this, because they do not provide any synchronisation. They may make
some things less likely or some actual patterns effectively safe...

Anyway, I'm not arguing against adding stuff here for it. I repeat:
that isn't concerned with this patch.


  Winchip: can any of these CPUs with ooostores do SMP? If not, then smp_wmb
  can also be a simple barrier on i386 too.
 
 The IDT Winchip can do SMP apparently. Nobody has ever seen an SMP
 winchip board so I don't think it matters and even if one existed we
 wouldn't support it anyway

OK, so in theory we could make smp_wmb just a simple barrier, however the
code for ooostore there is under a config option, so there is no harm in
keeping it around really.


 However 
 - You've not shown the patch has any performance gain

Can I theorise that it will reduce icache misses and be done with? ;)


 - You've probably broken Pentium Pro

Doubt it. It would be really interesting to have an _exact_ trace of
how it breaks that isn't fundamentally broken already. I suspect that
information will never come out of intel. But again, if you think we
need some stronger stuff in smp_wmb, I can do a patch that depends
on the broken ppro stores config option if you just tell me what should
be there (again, remember that my patch isn't actually changing anything
already there except for smp_rmb side).


 - and for modern processors its still not remotely clear your patch is
 correct because of NT stores.

No. We already _have_ to rely on this model for barriers because we
have a lock-less spin_unlock. We can either: put explicit mfences around
all stores that aren't executed inorder WRT other stores; or make
spin_unlock use the lock prefix.

We have pretty explicitly decided to do the former, and the fact that
subsequent patches to implement these exotic stores have broken the
memory ordering model does in no way invalidate this patch, IMO.


 
 So NAK

Thanks :) Very nice feedback (I wasn't having a gentle dig at you,
honest!)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] x86: improved memory barrier implementation

2007-09-28 Thread Linus Torvalds


On Fri, 28 Sep 2007, Alan Cox wrote:
 
 However 
 - You've not shown the patch has any performance gain

It would be nice to see this.

 - You've probably broken Pentium Pro

Probably not a big deal, but yeah, we should have that broken-ppro option.

 - and for modern processors its still not remotely clear your patch is
 correct because of NT stores.

This one I disagree with. The *old* code doesn't honor NT stores *either*. 

The fact is, if you use NT stores and then depend on ordering, it has 
nothing what-so-ever to do with spinlocks or smp_[rw]mb. You need to use 
the DMA barriers (ie the non-smp ones).

The non-temporal stores should be basically considered to be IO, not any 
normal memory operation.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/