Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

2023-07-04 Thread Paul E. McKenney via Gcc
On Tue, Jul 04, 2023 at 04:25:45PM -0400, Alan Stern wrote:
> On Tue, Jul 04, 2023 at 01:19:23PM -0400, Olivier Dion wrote:

[ . . . ]

> > I am puzzled by this.  Initialization of a shared variable does not need
> > to be atomic until its publication.  Could you expand on this?
> 
> In the kernel, I believe it sometimes happens that an atomic variable 
> may be published before it is initialized.  (If that's wrong, Paul or 
> Peter can correct me.)  But since this doesn't apply to the situations 
> you're concerned with, you can forget I mentioned it.

Both use cases exist.

A global atomic is implicitly published at compile time.  If the desired
initial value is not known until multiple threads are running, then it
is necessary to be careful.  Hence double-check locking and its various
replacements.  (Clearly, if you can determine the initial value before
going multithreaded, life is simpler.)

And dynamically allocated or on-stack storage is the other case, where
there is a point in time when the storage is private even after multiple
threads are running.

Or am I missing the point?

Thanx, Paul


Re: [PATCH] Adding _Dependent_ptr type qualifier in C part 1/3

2019-08-16 Thread Paul E. McKenney
And please see the following URL for an update:

http://www.rdrop.com/~paulmck/submission/D%20_Dependent_ptr%20to%20simplify%20carries%20a%20dependency.pdf

This adds a description of conversions, warnings, and patches.

Thanx, Paul

On Sun, Aug 04, 2019 at 04:11:12PM -0700, Paul E. McKenney wrote:
> Good points!
> 
> On the type-qualifier interactions, here is an initial list.  Thoughts?
> 
>   Thanx, Paul
> 
> _Alignas: Dependency-breaking optimizations would be avoided, and the
> variable would be aligned as specified.
> 
> _Atomic: Dependency-breaking optimizations would be avoided, and the
> variable would be accessed using C11 atomics.
> 
> const: This is not particularly useful for variables with static storage
> duration because compile-time initialization does not require dependency
> ordering, but then again, use of _Dependent_ptr on such variables is
> suspect to begin with.  Otherwise, the const _Dependent_ptr variable
> would normally be initialized from another _Dependent_ptr variable or
> from a memory_order_consume load.  The variable would disallow further
> stores and avoid breaking dependencies.
> 
> extern: Dependency-breaking optimizations would be avoided, and the
> variable would be usable within other translation units.  This is
> also an unusual addition to a _Dependent_ptr unless also accompanied by
> _Thread_local because there are no known non-suspect multi-threaded-access
> use cases for _Dependent_ptr.
> 
> register: Dependency-breaking optimizations would be avoided, and the
> compiler would be given a hint to keep the variable in a register.
> 
> restrict: Dependency-breaking optimizations would be avoided, and the
> compiler may assume that the pointed-to object is only accessed through
> this pointer and through pointers derived from it.
> 
> static: Dependency-breaking optimizations would be avoided, and the
> variable would be static.  This is also an unusual addition to a
> _Dependent_ptr unless also accompanied by _Thread_local because there are
> no known non-suspect multi-threaded-access use cases for _Dependent_ptr.
> 
> _Thread_local: The dependency-carrying variable is thread-local, and
> avoids dependency-breaking optimizations.
> 
> volatile: All accesses would be executed as per the abstract machine,
> and dependency-breaking optimizations would be avoided.
> 
> On Fri, Aug 02, 2019 at 08:30:49PM -0600, Martin Sebor wrote:
> > On 8/1/19 9:26 AM, Paul McKenney wrote:
> > >Excellent point, this discussion needs to be made official.
> > >Please see attached for an initial draft of a working paper.
> > >
> > >Thoughts?
> > 
> > The draft is a start but it (obviously) needs a lot of work to cover
> > the constraints and semantics so I'll just comment on the patch in
> > a bit more detail.
> > 
> > Since the feature is being (or will be) proposed to WG14 and could
> > change I would expect it to be only available under -std=c2x and
> > disabled otherwise, so that code that makes use of it does so with
> > the understanding that it's only experimental.  (I just noticed
> > Akshat email with a tweak to the patch.  I'm not sure that issuing
> > a pedantic warning and having the name in the implementation namepace
> > is enough but others (the C FE maintainers) will know better.)
> > 
> > Other than that, I would also expect to see more extensive test
> > coverage, at a minimum to exercise error detection (invalid uses,
> > conversions, etc.).  For example, I see the code that rejects it
> > but no tests for declaring, say, a _Dependent_ptr-qualified integer.
> > What I don't think I see is code that rejects _Dependent_ptr-qualified
> > function pointers (AFAIK, POINTER_TYPE_P evaluates to nonzero for both).
> > Is that intended?  (And if so, what does it mean?)I also see that
> > the new tests look for warnings but it's not clear to me that that's
> > their intent or that the dg-warning directives are being used to
> > "suppress" warnings for issues in the tests.  For instance, this
> > is common:
> > 
> >   rcu_assign_pointer (gp,p);/* { dg-warning
> > "\\\[-Wincompatible-pointer-types]" } */
> > 
> > but neither gp nor p is a  _Dependent_ptr.  (I may be missing
> > the purpose of the compile-only tests.  E.g., the only thing
> > p0190r4_fig10.c seems to be exercising besides that the _Dependent_ptr
> > qualifier is accepted is a couple of warnings, one of which is the one
> > above.)  I would suggest to look at tests for other qualifiers f

Re: [PATCH] Adding _Dependent_ptr type qualifier in C part 1/3

2019-08-04 Thread Paul E. McKenney
Good points!

On the type-qualifier interactions, here is an initial list.  Thoughts?

Thanx, Paul

_Alignas: Dependency-breaking optimizations would be avoided, and the
variable would be aligned as specified.

_Atomic: Dependency-breaking optimizations would be avoided, and the
variable would be accessed using C11 atomics.

const: This is not particularly useful for variables with static storage
duration because compile-time initialization does not require dependency
ordering, but then again, use of _Dependent_ptr on such variables is
suspect to begin with.  Otherwise, the const _Dependent_ptr variable
would normally be initialized from another _Dependent_ptr variable or
from a memory_order_consume load.  The variable would disallow further
stores and avoid breaking dependencies.

extern: Dependency-breaking optimizations would be avoided, and the
variable would be usable within other translation units.  This is
also an unusual addition to a _Dependent_ptr unless also accompanied by
_Thread_local because there are no known non-suspect multi-threaded-access
use cases for _Dependent_ptr.

register: Dependency-breaking optimizations would be avoided, and the
compiler would be given a hint to keep the variable in a register.

restrict: Dependency-breaking optimizations would be avoided, and the
compiler may assume that the pointed-to object is only accessed through
this pointer and through pointers derived from it.

static: Dependency-breaking optimizations would be avoided, and the
variable would be static.  This is also an unusual addition to a
_Dependent_ptr unless also accompanied by _Thread_local because there are
no known non-suspect multi-threaded-access use cases for _Dependent_ptr.

_Thread_local: The dependency-carrying variable is thread-local, and
avoids dependency-breaking optimizations.

volatile: All accesses would be executed as per the abstract machine,
and dependency-breaking optimizations would be avoided.

On Fri, Aug 02, 2019 at 08:30:49PM -0600, Martin Sebor wrote:
> On 8/1/19 9:26 AM, Paul McKenney wrote:
> >Excellent point, this discussion needs to be made official.
> >Please see attached for an initial draft of a working paper.
> >
> >Thoughts?
> 
> The draft is a start but it (obviously) needs a lot of work to cover
> the constraints and semantics so I'll just comment on the patch in
> a bit more detail.
> 
> Since the feature is being (or will be) proposed to WG14 and could
> change I would expect it to be only available under -std=c2x and
> disabled otherwise, so that code that makes use of it does so with
> the understanding that it's only experimental.  (I just noticed
> Akshat email with a tweak to the patch.  I'm not sure that issuing
> a pedantic warning and having the name in the implementation namepace
> is enough but others (the C FE maintainers) will know better.)
> 
> Other than that, I would also expect to see more extensive test
> coverage, at a minimum to exercise error detection (invalid uses,
> conversions, etc.).  For example, I see the code that rejects it
> but no tests for declaring, say, a _Dependent_ptr-qualified integer.
> What I don't think I see is code that rejects _Dependent_ptr-qualified
> function pointers (AFAIK, POINTER_TYPE_P evaluates to nonzero for both).
> Is that intended?  (And if so, what does it mean?)I also see that
> the new tests look for warnings but it's not clear to me that that's
> their intent or that the dg-warning directives are being used to
> "suppress" warnings for issues in the tests.  For instance, this
> is common:
> 
>   rcu_assign_pointer (gp,p);  /* { dg-warning
> "\\\[-Wincompatible-pointer-types]" } */
> 
> but neither gp nor p is a  _Dependent_ptr.  (I may be missing
> the purpose of the compile-only tests.  E.g., the only thing
> p0190r4_fig10.c seems to be exercising besides that the _Dependent_ptr
> qualifier is accepted is a couple of warnings, one of which is the one
> above.)  I would suggest to look at tests for other qualifiers for
> examples and model the new ones after those.
> 
> I'm also wondering how the new qualifier interacts with others like
> const.  Should all combinations of qualifiers be accepted and do
> they affect the semantics in any interesting way?  This is probably
> something to cover in the proposal.
> 
> Martin
> 
> >
> >  Thanx, Paul
> >
> >On Tue, Jul 30, 2019 at 12:46 PM Martin Sebor  wrote:
> >>
> >>On 7/30/19 1:13 AM, Akshat Garg wrote:
> >>>Hi,
> >>>This patch includes C front-end code for a type qualifier _Dependent_ptr.
> >>
> >>Just some very high-level comments/questions.  I only followed
> >>the _Dependent_ptr discussion from a distance and I'm likely
> >>missing some context so the first thing I looked for in this
> >>patch is documentation of the new qualifier.  Unless it's
> >>a proposed C2x feature that I missed I would expect to find it
> >>in section 6 - Extensions to 

Re: Doubts regarding the _Dependent_ptr keyword

2019-07-04 Thread Paul E. McKenney
On Thu, Jul 04, 2019 at 01:00:18PM +0200, Richard Biener wrote:
> On Wed, Jul 3, 2019 at 6:33 PM Paul E. McKenney  wrote:
> >
> > On Wed, Jul 03, 2019 at 05:47:56PM +0200, Richard Biener wrote:
> > > On July 3, 2019 5:14:58 PM GMT+02:00, "Paul E. McKenney" 
> > >  wrote:
> > > >On Wed, Jul 03, 2019 at 12:39:41AM +0530, Akshat Garg wrote:
> > > >> On Tue, Jul 2, 2019 at 8:40 PM Paul E. McKenney
> > > >
> > > >> wrote:
> > > >>
> > > >> > On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan
> > > >wrote:
> > > >> > > On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney
> > > >
> > > >> > wrote:
> > > >> > >
> > > >> > > >
> > > >> > > > Once a user-created non-dependent pointer is assigned to, it is
> > > >OK to
> > > >> > > > break the dependency.
> > > >> > >
> > > >> > > Ok, that's good.
> > > >> > > >
> > > >> > > > Or am I missing the point here?
> > > >> > >
> > > >> > > I was just trying to make sure we were on the same page. I wonder
> > > >if
> > > >> > > marking this volatile would be sufficient for prototyping. I
> > > >suspect
> > > >> > > we would need another flag somewhere which someone with gimple
> > > >> > > knowledge might be able to help us with.
> > > >> >
> > > >> > I expect that marking it as volatile would do the trick.  ;-)
> > > >> >
> > > >> > Thanx, Paul
> > > >> >
> > > >> So, marking this pointer as volatile will not allow the compiler to
> > > >> modify/optimize the statements, the pointer is appearing in. And we
> > > >don't
> > > >> need to push any other code inside any of the passes. Due to this, we
> > > >want
> > > >> to automatically say those dependent pointers are volatile and
> > > >introduce a
> > > >> new flag for this. Am I getting you guys correctly? Kindly, let me
> > > >know?
> > > >
> > > >While I suspect that this might work, it would suppress way more
> > > >optimizations than would be good.  For but one example, consider:
> > > >
> > > > _Dependent_ptr int *p;
> > > >
> > > > p = atomic_load_explicit(gp, memory_order_consume);
> > > > a = p->a;
> > > > b = p->b;
> > > >
> > > >If "p" is volatile, then the compiler will be prevented from keeping
> > > >it in a register, which would not make people coding fastpaths at
> > > >all happy.  ;-)
> > > >
> > > >Still, use of volatile might be a good technique for prototyping and
> > > >analysis of _Dependent_ptr.
> > >
> > > With this example can you quickly summarize what kind of guarantees 
> > > _Dependent_ptr gives and how a compiler
> > > Could possibly break those?
> >
> > First I suppose I should fix the bug in the above code.  Or one of the
> > bugs, at least.  :-/
> >
> > struct foo {
> > int a;
> > int b;
> > };
> >
> > _Dependent_ptr struct foo *p;
> >
> > p = atomic_load_explicit(gp, memory_order_consume);
> > a = p->a;
> > b = p->b;
> >
> > And then let me tweak the example a bit.  For the first tweak:
> >
> > struct foo {
> > int a;
> > int b;
> > };
> >
> > struct foo default_foo = { .a = 42, .b = 43 };
> > int *gp = _foo;
> >
> > ...
> >
> > _Dependent_ptr int *p;
> >
> > p = atomic_load_explicit(gp, memory_order_consume);
> > a = p->a;
> > b = p->b;
> >
> > Suppose that the compiler used feedback-driven optimization, and noticed
> > that the value of gp was almost always _foo.  The compiler might
> > decide to transform the last three lines as follows:
> >
> > p = atomic_load_explicit(gp, memory_order_consume);
> > if (p == _foo) {
> > a = default_foo.a;
> > b = default_foo.b;
> > 

Re: Doubts regarding the _Dependent_ptr keyword

2019-07-03 Thread Paul E. McKenney
On Wed, Jul 03, 2019 at 05:47:56PM +0200, Richard Biener wrote:
> On July 3, 2019 5:14:58 PM GMT+02:00, "Paul E. McKenney" 
>  wrote:
> >On Wed, Jul 03, 2019 at 12:39:41AM +0530, Akshat Garg wrote:
> >> On Tue, Jul 2, 2019 at 8:40 PM Paul E. McKenney
> >
> >> wrote:
> >> 
> >> > On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan
> >wrote:
> >> > > On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney
> >
> >> > wrote:
> >> > >
> >> > > >
> >> > > > Once a user-created non-dependent pointer is assigned to, it is
> >OK to
> >> > > > break the dependency.
> >> > >
> >> > > Ok, that's good.
> >> > > >
> >> > > > Or am I missing the point here?
> >> > >
> >> > > I was just trying to make sure we were on the same page. I wonder
> >if
> >> > > marking this volatile would be sufficient for prototyping. I
> >suspect
> >> > > we would need another flag somewhere which someone with gimple
> >> > > knowledge might be able to help us with.
> >> >
> >> > I expect that marking it as volatile would do the trick.  ;-)
> >> >
> >> > Thanx, Paul
> >> >
> >> So, marking this pointer as volatile will not allow the compiler to
> >> modify/optimize the statements, the pointer is appearing in. And we
> >don't
> >> need to push any other code inside any of the passes. Due to this, we
> >want
> >> to automatically say those dependent pointers are volatile and
> >introduce a
> >> new flag for this. Am I getting you guys correctly? Kindly, let me
> >know?
> >
> >While I suspect that this might work, it would suppress way more
> >optimizations than would be good.  For but one example, consider:
> >
> > _Dependent_ptr int *p;
> >
> > p = atomic_load_explicit(gp, memory_order_consume);
> > a = p->a;
> > b = p->b;
> >
> >If "p" is volatile, then the compiler will be prevented from keeping
> >it in a register, which would not make people coding fastpaths at
> >all happy.  ;-)
> >
> >Still, use of volatile might be a good technique for prototyping and
> >analysis of _Dependent_ptr.
> 
> With this example can you quickly summarize what kind of guarantees 
> _Dependent_ptr gives and how a compiler
> Could possibly break those? 

First I suppose I should fix the bug in the above code.  Or one of the
bugs, at least.  :-/

struct foo {
int a;
int b;
};

_Dependent_ptr struct foo *p;

p = atomic_load_explicit(gp, memory_order_consume);
a = p->a;
b = p->b;

And then let me tweak the example a bit.  For the first tweak:

struct foo {
int a;
int b;
};

struct foo default_foo = { .a = 42, .b = 43 };
int *gp = _foo;

...

_Dependent_ptr int *p;

p = atomic_load_explicit(gp, memory_order_consume);
a = p->a;
b = p->b;

Suppose that the compiler used feedback-driven optimization, and noticed
that the value of gp was almost always _foo.  The compiler might
decide to transform the last three lines as follows:

p = atomic_load_explicit(gp, memory_order_consume);
if (p == _foo) {
a = default_foo.a;
b = default_foo.b;
} else {
a = p->a;
b = p->b;
}

Now, as long as the value of gp had remained _foo for the full
duration of execution, no harm done.  But suppose the following code
was executing concurrently with the above transformed code:

struct foo *q;

q = malloc(sizeof(*q));
assert(q);
q->a = 1729;
q->b = 1730;
atomic_store_explicit(gp, q, memory_order_release);
do_something();
default_foo.a = 1;
default_foo.b = 2;
atomic_store_explicit(gp, _foo, memory_order_release);

In this case, if the memory_order_consume() came just after the pointer
was reset to _foo, it is possible that the transformed code
would set "a" to 42 and "b" to 43, which might not be what the guy
writing the code wanted to happen.

One of the purposes of _Dependent_ptr is to prevent this transformation.

This transformation can also happen if the developer's code contained a
comparison to _foo -- an ARM or PowerPC compiler backend, upon
seeing two pointers containing the same bits, would likely consider the
two pointers as being intercha

Re: Doubts regarding the _Dependent_ptr keyword

2019-07-03 Thread Paul E. McKenney
On Wed, Jul 03, 2019 at 12:39:41AM +0530, Akshat Garg wrote:
> On Tue, Jul 2, 2019 at 8:40 PM Paul E. McKenney 
> wrote:
> 
> > On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan wrote:
> > > On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney 
> > wrote:
> > >
> > > >
> > > > Once a user-created non-dependent pointer is assigned to, it is OK to
> > > > break the dependency.
> > >
> > > Ok, that's good.
> > > >
> > > > Or am I missing the point here?
> > >
> > > I was just trying to make sure we were on the same page. I wonder if
> > > marking this volatile would be sufficient for prototyping. I suspect
> > > we would need another flag somewhere which someone with gimple
> > > knowledge might be able to help us with.
> >
> > I expect that marking it as volatile would do the trick.  ;-)
> >
> > Thanx, Paul
> >
> So, marking this pointer as volatile will not allow the compiler to
> modify/optimize the statements, the pointer is appearing in. And we don't
> need to push any other code inside any of the passes. Due to this, we want
> to automatically say those dependent pointers are volatile and introduce a
> new flag for this. Am I getting you guys correctly? Kindly, let me know?

While I suspect that this might work, it would suppress way more
optimizations than would be good.  For but one example, consider:

_Dependent_ptr int *p;

p = atomic_load_explicit(gp, memory_order_consume);
a = p->a;
b = p->b;

If "p" is volatile, then the compiler will be prevented from keeping
it in a register, which would not make people coding fastpaths at
all happy.  ;-)

Still, use of volatile might be a good technique for prototyping and
analysis of _Dependent_ptr.

Thanx, Paul

> Akshat
> 
> >
> > > regards
> > > Ramana
> > >
> > > >
> > > > Thanx, Paul
> > > >
> > > > > Ramana
> > > > > >>
> > > > > >>
> > > > > >> > Does this sounds like a workable plan for ? Let me know your
> > thoughts. If this sounds good then, we can do this for all the
> > optimizations that may kill the dependencies at somepoint.
> > > > > >> >
> > > > > >> > -Akshat
> > > > >
> > > >
> > >
> >
> >



Re: Doubts regarding the _Dependent_ptr keyword

2019-07-03 Thread Paul E. McKenney
On Tue, Jul 02, 2019 at 07:53:20PM +0200, Richard Biener wrote:
> On July 2, 2019 5:36:08 PM GMT+02:00, Jason Merrill  wrote:
> >On Mon, Jul 1, 2019 at 8:59 PM Paul E. McKenney 
> >wrote:
> >>
> >> On Tue, Jul 02, 2019 at 05:58:48AM +0530, Akshat Garg wrote:
> >> > On Tue, Jun 25, 2019 at 9:49 PM Akshat Garg 
> >wrote:
> >> >
> >> > > On Tue, Jun 25, 2019 at 4:04 PM Ramana Radhakrishnan <
> >> > > ramana@googlemail.com> wrote:
> >> > >
> >> > >> On Tue, Jun 25, 2019 at 11:03 AM Akshat Garg 
> >wrote:
> >> > >> >
> >> > >> > As we have some working front-end code for _Dependent_ptr,
> >What should
> >> > >> we do next? What I understand, we can start adding the library
> >for
> >> > >> dependent_ptr and its functions for C corresponding to the ones
> >we created
> >> > >> as C++ template library. Then, after that, we can move on to
> >generating the
> >> > >> assembly code part.
> >> > >> >
> >> > >>
> >> > >>
> >> > >> I think the next step is figuring out how to model the Dependent
> >> > >> pointer information in the IR and figuring out what
> >optimizations to
> >> > >> allow or not with that information. At this point , I suspect we
> >need
> >> > >> a plan on record and have the conversation upstream on the
> >lists.
> >> > >>
> >> > >> I think we need to put down a plan on record.
> >> > >>
> >> > >> Ramana
> >> > >
> >> > > [CCing gcc mailing list]
> >> > >
> >> > > So, shall I start looking over the pointer optimizations only and
> >see what
> >> > > information we may be needed on the same examples in the IR
> >itself?
> >> > >
> >> > > - Akshat
> >> > >
> >> > I have coded an example where equality comparison kills dependency
> >from the
> >> > document P0190R4 as shown below :
> >> >
> >> > 1. struct rcutest rt = {1, 2, 3};
> >> > 2. void thread0 ()
> >> > 3. {
> >> > 4. rt.a = -42;
> >> > 5. rt.b = -43;
> >> > 6. rt.c = -44;
> >> > 7. rcu_assign_pointer(gp, );
> >> > 8. }
> >> > 9.
> >> > 10. void thread1 ()
> >> > 11. {
> >> > 12.int i = -1;
> >> > 13.int j = -1;
> >> > 14._Dependent_ptr struct rcutest *p;
> >> > 15.
> >> > 16.p = rcu_dereference(gp);
> >> > 17.j = p->a;
> >> > 18.   if (p == )
> >> > 19.i = p->b;  /*Dependency breaking point*/
> >> > 20.   else if(p)
> >> > 21.   i = p->c;
> >> > 22.   assert(i<0);
> >> > 23.   assert(j<0);
> >> > 24. }
> >> > The gimple unoptimized code produced for lines 17-24 is shown below
> >> >
> >> > 1. if (p_16 == )
> >> > 2. goto ; [INV]
> >> > 3.   else
> >> > 4.goto ; [INV]
> >> > 5.
> >> > 6.   :
> >> > 7.  i_19 = p_16->b;
> >> > 8.  goto ; [INV]
> >> > 9.
> >> > 10.   :
> >> > 11.  if (p_16 != 0B)
> >> > 12.goto ; [INV]
> >> > 13.  else
> >> > 14.goto ; [INV]
> >> > 15.
> >> > 16.   :
> >> > 17.  i_18 = p_16->c;
> >> > 18.
> >> > 19.   :
> >> > 20.  # i_7 = PHI 
> >> > 21.  _3 = i_7 < 0;
> >> > 22.  _4 = (int) _3;
> >> > 23.  assert (_4);
> >> > 24.  _5 = j_17 < 0;
> >> > 25.  _6 = (int) _5;
> >> > 26.  assert (_6);
> >> > 27.  return;
> >> >
> >> > The optimized code after -O1 is applied for the same lines is hown
> >below :
> >> >
> >> > 1. if (_2 == )
> >> > 2.goto ; [30.00%]
> >> > 3. else
> >> > 4.goto ; [70.00%]
> >> > 5.
> >> > 6.   [local count: 322122547]:
> >> > 7.   i_12 = rt.b;
> >> > 8.   goto ; [100.00%]
> >> > 9.
> >> > 10.   [local count: 751619277]:
> >> > 11.   if (_1 != 0)
> >> > 12.   goto ; [50.00%]
> >> > 13.   else
> >> > 14.goto ;

Re: Doubts regarding the _Dependent_ptr keyword

2019-07-02 Thread Paul E. McKenney
On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan wrote:
> On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney  wrote:
> 
> >
> > Once a user-created non-dependent pointer is assigned to, it is OK to
> > break the dependency.
> 
> Ok, that's good.
> >
> > Or am I missing the point here?
> 
> I was just trying to make sure we were on the same page. I wonder if
> marking this volatile would be sufficient for prototyping. I suspect
> we would need another flag somewhere which someone with gimple
> knowledge might be able to help us with.

I expect that marking it as volatile would do the trick.  ;-)

Thanx, Paul

> regards
> Ramana
> 
> >
> > Thanx, Paul
> >
> > > Ramana
> > > >>
> > > >>
> > > >> > Does this sounds like a workable plan for ? Let me know your 
> > > >> > thoughts. If this sounds good then, we can do this for all the 
> > > >> > optimizations that may kill the dependencies at somepoint.
> > > >> >
> > > >> > -Akshat
> > >
> >
> 



Re: Doubts regarding the _Dependent_ptr keyword

2019-07-02 Thread Paul E. McKenney
On Tue, Jul 02, 2019 at 12:01:00PM +0100, Ramana Radhakrishnan wrote:
> >>
> >> It's worth figuring out what passes are doing this - however the worry
> >> I have is that every pass now needs to be handling this case with
> >> respect to pointer attributes. Is there some place that you are
> >> storing said information and what is the transitive nature of
> >> assignments with these attributes ?
> >>
> >> regards
> >> Ramana
> >
> > I don't understand what information to store, can you explain? I was 
> > thinking of putting some code inside the pass, which breaks the dependency 
> > chains, which will avoid this type of conversions when it finds out  that 
> > the pointer is _Dependent_ptr qualified otherwise don't do anything. Can 
> > you let me know what I am missing on?
> >
> 
> That the pointer has an attribute that it's a dependent pointer . How
> do you expect the later passes to detect it has a dependent_ptr
> attribute attached to it ?
> 
> > In transitive nature, are you talking about the dependent pointer being 
> > assigned to some non-dependent pointer and after further assigned to some 
> > other pointer?
> 
> Yes. What's the expected behaviour as per the document ?

Once a user-created non-dependent pointer is assigned to, it is OK to
break the dependency.

Or am I missing the point here?

Thanx, Paul

> Ramana
> >>
> >>
> >> > Does this sounds like a workable plan for ? Let me know your thoughts. 
> >> > If this sounds good then, we can do this for all the optimizations that 
> >> > may kill the dependencies at somepoint.
> >> >
> >> > -Akshat
> 



Re: Doubts regarding the _Dependent_ptr keyword

2019-07-01 Thread Paul E. McKenney
On Tue, Jul 02, 2019 at 05:58:48AM +0530, Akshat Garg wrote:
> On Tue, Jun 25, 2019 at 9:49 PM Akshat Garg  wrote:
> 
> > On Tue, Jun 25, 2019 at 4:04 PM Ramana Radhakrishnan <
> > ramana@googlemail.com> wrote:
> >
> >> On Tue, Jun 25, 2019 at 11:03 AM Akshat Garg  wrote:
> >> >
> >> > As we have some working front-end code for _Dependent_ptr, What should
> >> we do next? What I understand, we can start adding the library for
> >> dependent_ptr and its functions for C corresponding to the ones we created
> >> as C++ template library. Then, after that, we can move on to generating the
> >> assembly code part.
> >> >
> >>
> >>
> >> I think the next step is figuring out how to model the Dependent
> >> pointer information in the IR and figuring out what optimizations to
> >> allow or not with that information. At this point , I suspect we need
> >> a plan on record and have the conversation upstream on the lists.
> >>
> >> I think we need to put down a plan on record.
> >>
> >> Ramana
> >
> > [CCing gcc mailing list]
> >
> > So, shall I start looking over the pointer optimizations only and see what
> > information we may be needed on the same examples in the IR itself?
> >
> > - Akshat
> >
> I have coded an example where equality comparison kills dependency from the
> document P0190R4 as shown below :
> 
> 1. struct rcutest rt = {1, 2, 3};
> 2. void thread0 ()
> 3. {
> 4. rt.a = -42;
> 5. rt.b = -43;
> 6. rt.c = -44;
> 7. rcu_assign_pointer(gp, );
> 8. }
> 9.
> 10. void thread1 ()
> 11. {
> 12.int i = -1;
> 13.int j = -1;
> 14._Dependent_ptr struct rcutest *p;
> 15.
> 16.p = rcu_dereference(gp);
> 17.j = p->a;
> 18.   if (p == )
> 19.i = p->b;  /*Dependency breaking point*/
> 20.   else if(p)
> 21.   i = p->c;
> 22.   assert(i<0);
> 23.   assert(j<0);
> 24. }
> The gimple unoptimized code produced for lines 17-24 is shown below
> 
> 1. if (p_16 == )
> 2. goto ; [INV]
> 3.   else
> 4.goto ; [INV]
> 5.
> 6.   :
> 7.  i_19 = p_16->b;
> 8.  goto ; [INV]
> 9.
> 10.   :
> 11.  if (p_16 != 0B)
> 12.goto ; [INV]
> 13.  else
> 14.goto ; [INV]
> 15.
> 16.   :
> 17.  i_18 = p_16->c;
> 18.
> 19.   :
> 20.  # i_7 = PHI 
> 21.  _3 = i_7 < 0;
> 22.  _4 = (int) _3;
> 23.  assert (_4);
> 24.  _5 = j_17 < 0;
> 25.  _6 = (int) _5;
> 26.  assert (_6);
> 27.  return;
> 
> The optimized code after -O1 is applied for the same lines is hown below :
> 
> 1. if (_2 == )
> 2.goto ; [30.00%]
> 3. else
> 4.goto ; [70.00%]
> 5.
> 6.   [local count: 322122547]:
> 7.   i_12 = rt.b;
> 8.   goto ; [100.00%]
> 9.
> 10.   [local count: 751619277]:
> 11.   if (_1 != 0)
> 12.   goto ; [50.00%]
> 13.   else
> 14.goto ; [50.00%]
> 15.
> 16.   [local count: 375809638]:
> 17.   i_11 = MEM[(dependent_ptr struct rcutest *)_2].c;
> 18.
> 19.[local count: 1073741824]:
> 20.  # i_7 = PHI 
> 21.   _3 = i_7 < 0;
> 22.   _4 = (int) _3;
> 23.   assert (_4);
> 24.  _5 = j_10 < 0;
> 25.  _6 = (int) _5;
> 26.   assert (_6);
> 27.   return;

Good show on tracing this through!

> Statement 19 in the program gets converted from  i_19 = p_16->b; in line 7
> in unoptimized code to i_12 = rt.b; in line 7 in optimized code which
> breaks the dependency chain. We need to figure out the pass that does that
> and put some handling code in there for the _dependent_ptr qualified
> pointers. Passing simply -fipa-pure-const, -fguess-branch-probability or
> any other option alone does not produce the optimized code that breaks the
> dependency. But applying -O1, i.e., allowing all the optimizations does so.
> As passes are applied in a certain order, we need to figure out upto what
> passes, the code remains same and after what pass the dependency does not
> holds. So, we need to check the translated code after every pass.
> 
> Does this sounds like a workable plan for ? Let me know your thoughts. If
> this sounds good then, we can do this for all the optimizations that may
> kill the dependencies at somepoint.

I don't know of a better plan.

My usual question...  Is there some way to script the checking of the
translated code at the end of each pass?

Thanx, Paul



Re: Importance of transformations that turn data dependencies into control dependencies?

2016-03-01 Thread Paul E. McKenney
On Tue, Mar 01, 2016 at 05:55:07PM +0100, Michael Matz wrote:
> Hi,
> 
> On Tue, 1 Mar 2016, Richard Biener wrote:
> 
> > > What about the example I gave above?  Is it unrealistic for compilers 
> > > do ever do something like this, or is it just unlikely to gain much 
> > > performance, or is it just that GCC does not do this today?
> > 
> > GCC does not do this today with the exception of value-profiling.  GCC 
> > in other cases does not establish equivalences but only relations (a < 
> > b, etc.) that are not a problem as far as I can see because those do not 
> > allow to change expressions using a to use b.
> 
> Made up example using relations:
> 
> int32 a, b;
> a = (b >> 31) & 1;
> 
> -->
> 
> if (b < 0)
>   a = 1;
> else
>   a = 0;
> 
> data-dep to control-dep and only relations :)  (I think this is taken care 
> of by Pauls wording, ignoring the fact that these aren't pointers anyway 
> and hence don't carry a dependency through them, only onto them at max)

Agreed, these are not pointers, and therefore do not carry dependencies.

Thanx, Paul



Re: [isocpp-parallel] Proposal for new memory_order_consume definition

2016-02-29 Thread Paul E. McKenney
On Mon, Feb 29, 2016 at 07:17:55PM +0100, Michael Matz wrote:
> Hi,
> 
> On Sat, 27 Feb 2016, Paul E. McKenney wrote:
> 
> > But we do already have something very similar with signed integer
> > overflow.  If the compiler can see a way to generate faster code that
> > does not handle the overflow case, then the semantics suddenly change
> > from twos-complement arithmetic to something very strange.  The standard
> > does not specify all the ways that the implementation might deduce that
> > faster code can be generated by ignoring the overflow case, it instead
> > simply says that signed integer overflow invoked undefined behavior.
> > 
> > And if that is a problem, you use unsigned integers instead of signed
> > integers.
> > 
> > So it seems that we should be able to do something very similar here.
> 
> For this case the important pice of information to convey one or the other 
> meaning in source code is the _type_ of involved entities, not annotations 
> on the operations.  signed type -> undefined overflow, unsigned type -> 
> modulo arithmetic; easy, and it nicely carries automatically through 
> operation chains (and pointers) without any annotations.
> 
> I feel much of the complexity in the memory order specifications, also 
> with your recent (much better) wording to explain dependency chains, would 
> be much easier if the 'carries-dependency' would be encoded into the types 
> of operands.  For purpose of example, let's call the marker "blaeh" (not 
> atomic to not confuse with existing use :) ):
> 
> int foo;
> blaeh int global;
> int *somep;
> blae int *blaehp;
> f () {
>   blaehp =   // might be okay, adds restrictions on accesses through 
>   // blaehp, but not through 'foo' directly
>   blaehp = 
>   if (somep == blaehp)
> {
>   /* Even though the value is equal ... */
>   ... *blaehp ... /* ... a compiler can't rewrite this into *somep */
> }
> }
> 
> A "carries-dependency" on some operation (e.g. a call) would be added by 
> using a properly typed pointer at those arguments (or return type) where 
> it matters.  You can't give a blaeh pointer to something only accepting 
> non-blaeh pointers (without cast).
> 
> Pointer addition and similar transformations involving a blaeh pointer and 
> some integer would still give a blaeh pointer, and hence by default also 
> solve the problem of cancellations.
> 
> Such marking via types would not solve all problems in an optimal way if 
> you had two overlapping but independend dependency chains (all of them 
> would collapse to one chain and hence made dependend, which still is 
> conservatively correct).
> 
> OTOH introducing new type qualifiers is a much larger undertaking, so I 
> can understand one wants to avoid this.  I think it'd ultimately be 
> clearer, though.

As has been stated in this thread, we do need the unmarked variant.

For the marked variant, there are quite a few possible solutions with
varying advantages and disadvantages:

o   Attribute already exists, but is not carried by the type system.
Could be enforced by external tools.

o   Storage class could be added with fewer effects on the type
system, but the reaction to this suggestion in October was
not all that positive.

o   Non-type keywords for objects has been suggested, might be worth
revisiting.

o   Adding to the type system allows type enforcement on the one
hand, but makes it harder to write code that can be used for
both RCU-protected and not-RCU-protected data structures.
(This sort of thing is not uncommon in the Linux kernel.)

There are probably others, but those are the ones I recall at the
moment.

Thanx, Paul



Re: [isocpp-parallel] Proposal for new memory_order_consume definition

2016-02-27 Thread Paul E. McKenney
On Sat, Feb 27, 2016 at 11:16:51AM -0800, Linus Torvalds wrote:
> On Feb 27, 2016 09:06, "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
> wrote:
> >
> >
> > But we do already have something very similar with signed integer
> > overflow.  If the compiler can see a way to generate faster code that
> > does not handle the overflow case, then the semantics suddenly change
> > from twos-complement arithmetic to something very strange.  The standard
> > does not specify all the ways that the implementation might deduce that
> > faster code can be generated by ignoring the overflow case, it instead
> > simply says that signed integer overflow invoked undefined behavior.
> >
> > And if that is a problem, you use unsigned integers instead of signed
> > integers.
> 
> Actually, in the case of there Linux kernel we just tell the compiler to
> not be an ass. We use
> 
>   -fno-strict-overflow

That is the one!

> or something. I forget the exact compiler flag needed for "the standard is
> as broken piece of shit and made things undefined for very bad reasons".
> 
> See also there idiotic standard C alias rules. Same deal.

For which we use -fno-strict-aliasing.

> So no, standards aren't that important. When the standards screw up, the
> right answer is not to turn the other cheek.

Agreed, hence my current (perhaps quixotic and insane) attempt to get
the standard to do something useful for dependency ordering.  But if
that doesn't work, yes, a fallback position is to get the relevant
compilers to provide flags to avoid problematic behavior, similar to
-fno-strict-overflow.

Thanx, Paul

> And undefined behavior is pretty much *always* a sign of "the standard is
> wrong".
> 
>  Linus



Re: [isocpp-parallel] Proposal for new memory_order_consume definition

2016-02-27 Thread Paul E. McKenney
On Thu, Feb 25, 2016 at 04:46:50PM -0800, Hans Boehm wrote:
> If carries_dependency affects semantics, then it should not be an attribute.

I am not picky about the form of the marking.

> The original design, or at least my understanding of it, was that it not
> have semantics; it was only a suggestion to the compiler that it should
> preserve dependencies instead of inserting a fence at the call site.
> Dependency-based ordering would be preserved in either case.  But I think
> we're moving away from that view towards something that doesn't quietly add
> fences.

Yes, we do need to allow typical implementations to avoid quiet fence
addition.

> I do not think we can quite get away with defining a dependency in a way
> that is unconditionally preserved by existing compilers, and thus I think
> that we do probably need annotations along the dependency path.  I just
> don't see a way to otherwise deal with the case in which a compiler infers
> an equivalent pointer and dereferences that instead of the original.  This
> can happen under so many (unlikely but) hard-to-define conditions that it
> seems undefinable in an implementation-independent manner. "If the
> implementation is able then " is, in my opinion, not
> acceptable standards text.

Hmmm...

But we do already have something very similar with signed integer
overflow.  If the compiler can see a way to generate faster code that
does not handle the overflow case, then the semantics suddenly change
from twos-complement arithmetic to something very strange.  The standard
does not specify all the ways that the implementation might deduce that
faster code can be generated by ignoring the overflow case, it instead
simply says that signed integer overflow invoked undefined behavior.

And if that is a problem, you use unsigned integers instead of signed
integers.

So it seems that we should be able to do something very similar here.
If you don't use marking, and the compiler deduces that a given pointer
that carries a given dependency is equal to some other pointer not
carrying that same dependency, there is no dependency ordering.  And,
just as with the signed-integer-overflow case, if that is a problem for
you, you can mark the pointers that you intend to carry dependencies.

In both the signed-integer-overflow and pointer-value-deduction cases,
most use cases don't need to care.  In the integer case, this is because
most use cases have small integer values that don't overflow.  In the
pointer case, this is because when the data structure is composed of
lots of heap-allocated data items, the compiler really cannot deduce
anything.

Other safe pointer use cases involve statically allocated data items
whose contents are compile-time constants (thus avoiding the need for
any sort of ordering) and sentinel data items (as in the Linux kernel's
cicular linked lists) where there is no dereferencing.

> Thus I see no way to both avoid adding syntax to functions that preserve
> dependencies and continue to allow existing transformations that remove
> dependencies we care about, e.g. due to equality comparisons.  We can
> hopefully ensure that without annotations compilers break things with very
> low probability, so that there is a reasonable path forward for existing
> code relying on dependency ordering (which currently also breaks with very
> low probability unless you understand what the compiler is doing).  But I
> don't see a way for the standard to guarantee correctness without the added
> syntax (or added optimization constraints that effectively assume all
> functions were annotated).

Your second sentence ("We can hopefully ensure...") does give me hope
that we might be able to reach agreement.  The intent of P0190R0 is
to define a subset of operations where dependencies will be carried.
Note that P0190R0 does call out comparisons as potentially unsafe.

        Thanx, Paul

> On Sat, Feb 20, 2016 at 11:53 AM, Paul E. McKenney <
> paul...@linux.vnet.ibm.com> wrote:
> 
> > On Fri, Feb 19, 2016 at 09:15:16PM -0500, Tony V E wrote:
> > > There's at least one easy answer in there:
> > >
> > > > ‎If implementations must support annotation, what form should that
> > >   annotation take?  P0190R0 recommends the [[carries_dependency]]
> > >   attribute, but I am not picky as long as it can be (1) applied
> > >   to all relevant pointer-like objects and (2) used in C as well
> > >   as C++.  ;-)
> > >
> > > If an implementation must support it, then it is not an annotation but a
> > keyword. So no [[]]
> >
> > I would be good with that approach, especially if the WG14 continues
> > to stay away from annotations.
> >
> > For whatever it is worth, the introduction o

Re: Importance of transformations that turn data dependencies into control dependencies?

2016-02-26 Thread Paul E. McKenney
On Fri, Feb 26, 2016 at 11:49:29AM +0100, Richard Biener wrote:
> On Thu, Feb 25, 2016 at 6:33 PM, Torvald Riegel  wrote:
> > On Wed, 2016-02-24 at 13:14 +0100, Richard Biener wrote:
> >> On Tue, Feb 23, 2016 at 8:38 PM, Torvald Riegel  wrote:
> >> > I'd like to know, based on the GCC experience, how important we consider
> >> > optimizations that may turn data dependencies of pointers into control
> >> > dependencies.  I'm thinking about all optimizations or transformations
> >> > that guess that a pointer might have a specific value, and then create
> >> > (specialized) code that assumes this value that is only executed if the
> >> > pointer actually has this value.  For example:
> >> >
> >> > int d[2] = {23, compute_something()};
> >> >
> >> > int compute(int v) {
> >> >   if (likely(v == 23)) return 23;
> >> >   else ;
> >> > }
> >> >
> >> > int bar() {
> >> >   int *p = ptr.load(memory_order_consume);
> >> >   size_t reveal_that_p_is_in_d = p - d[0];
> >> >   return compute(*p);
> >> > }
> >> >
> >> > Could be transformed to (after inlining compute(), and specializing for
> >> > the likely path):
> >> >
> >> > int bar() {
> >> >   int *p = ptr.load(memory_order_consume);
> >> >   if (p == d) return 23;
> >> >   else ;
> >> > }
> >>
> >> Note that if a user writes
> >>
> >>   if (p == d)
> >>{
> >>  ... do lots of stuff via p ...
> >>}
> >>
> >> GCC might rewrite accesses to p as accesses to d and thus expose
> >> those opportunities.  Is that a transform that isn't valid then or is
> >> the code written by the user (establishing the equivalency) to blame?
> >
> > In the context of this memory_order_consume proposal, this transform
> > would be valid because the program has already "reveiled" what value p
> > has after the branch has been taken.
> >
> >> There's a PR where this kind of equivalencies lead to unexpected (wrong?)
> >> points-to results for example.
> >>
> >> > Other potential examples that come to mind are de-virtualization, or
> >> > feedback-directed optimizations that has observed at runtime that a
> >> > certain pointer is likely to be always equal to some other pointer (eg.,
> >> > if p is almost always d[0], and specializing for that).
> >>
> >> That's the cases that are quite important in practice.
> >
> > Could you quantify this somehow, even if it's a very rough estimate?
> > I'm asking because it's significant and widely used, then this would
> > require users or compiler implementors to make a difficult trade-off
> > (ie, do you want mo_consume performance or performance through those
> > other optimizations?).
> 
> Probably resoved by your followup on how the transform is safe anyway.
> 
> >> > Also, it would be interesting to me to know how often we may turn data
> >> > dependencies into control dependencies in cases where this doesn't
> >> > affect performance significantly.
> >>
> >> I suppose we try to avoid that but can we ever know for sure?  Like
> >> speculative devirtualization does this (with the intent that it _does_ 
> >> matter,
> >> of course).
> >>
> >> I suppose establishing non-dependence isn't an issue, like with the
> >> vectorizer adding runtime dependence checks and applying versioning
> >> to get a vectorized and a not vectorized loop (in case there are 
> >> dependences)?
> >
> > I'm not sure I understand you correctly.  Do you have a brief example,
> > perhaps?  For mo_consume and its data dependencies, if there might be a
> > dependence, the compiler would have to preserve it; but I guess that
> > both a vectorized loop an one that accessses each element separately
> > would preserve dependences because it's doing those accesses, and they
> > depend on the input data.
> > OTOH, peraps HW vector instructions don't get the ordering guarantees
> > from data dependences -- Paul, do you know of any such cases?
> 
> A brief example would be for
> 
>  void foo (int *a, int *b, int n)
> {
>   for (int i = 0; i < n; ++i)
>a[i] = b[i];
> }
> 
> which we can vectorize like
> 
>   if (a + n < b || b + n < a)
>{
>   vectorized loop
>}
>   else
> {
>not vectorized loop
> }
> 
> note how we're not establishing equivalences between pointers but
> non-dependence vs. possible dependence.

Thank you for the clarification, I will check.

If I am not too confused, such a loop might want to used x86 non-temporal
stores, which require special handling even with acquire and release,
but that is presumably already handled.

Thanx, Paul

> >> > The background for this question is Paul McKenney's recently updated
> >> > proposal for a different memory_order_consume specification:
> >> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r0.pdf
> >> >
> >> > In a nutshell, this requires a compiler to either prove that a pointer
> >> > value is not carrying a dependency (simplified, its value somehow
> >> > originates from a 

Re: [isocpp-parallel] Proposal for new memory_order_consume definition

2016-02-20 Thread Paul E. McKenney
On Fri, Feb 19, 2016 at 09:15:16PM -0500, Tony V E wrote:
> There's at least one easy answer in there:
> 
> > ‎If implementations must support annotation, what form should that
>   annotation take?  P0190R0 recommends the [[carries_dependency]]
>   attribute, but I am not picky as long as it can be (1) applied
>   to all relevant pointer-like objects and (2) used in C as well
>   as C++.  ;-)
> 
> If an implementation must support it, then it is not an annotation but a 
> keyword. So no [[]] 

I would be good with that approach, especially if the WG14 continues
to stay away from annotations.

For whatever it is worth, the introduction of intrinsics for comparisons
that avoid breaking dependencies enables the annotation to remain
optional.

Thanx, Paul

> Sent from my BlackBerry portable Babbage Device
>   Original Message  
> From: Paul E. McKenney
> Sent: Thursday, February 18, 2016 4:58 AM
> To: paral...@lists.isocpp.org; linux-ker...@vger.kernel.org; 
> linux-a...@vger.kernel.org; gcc@gcc.gnu.org; llvm-...@lists.llvm.org
> Reply To: paral...@lists.isocpp.org
> Cc: pet...@infradead.org; j.algl...@ucl.ac.uk; will.dea...@arm.com; 
> dhowe...@redhat.com; ramana.radhakrish...@arm.com; luc.maran...@inria.fr; 
> a...@linux-foundation.org; peter.sew...@cl.cam.ac.uk; 
> torva...@linux-foundation.org; mi...@kernel.org
> Subject: [isocpp-parallel] Proposal for new memory_order_consume definition
> 
> Hello!
> 
> A proposal (quaintly identified as P0190R0) for a new memory_order_consume
> definition may be found here:
> 
> http://www2.rdrop.com/users/paulmck/submission/consume.2016.02.10b.pdf
> 
> As requested at the October C++ Standards Committee meeting, this
> is a follow-on to P0098R1 that picks one alternative and describes
> it in detail. This approach focuses on existing practice, with the
> goal of supporting existing code with existing compilers. In the last
> clang/LLVM patch I saw for basic support of this change, you could count
> the changed lines and still have lots of fingers and toes left over.
> Those who have been following this story will recognize that this is
> a very happy contrast to work that would be required to implement the
> definition in the current standard.
> 
> I expect that P0190R0 will be discussed at the upcoming C++ Standards
> Committee meeting taking place the week of February 29th. Points of
> discussion are likely to include:
> 
> o May memory_order_consume dependency ordering be used in
> unannotated code? I believe that this must be the case,
> especially given that this is our experience base. P0190R0
> therefore recommends this approach.
> 
> o If memory_order_consume dependency ordering can be used in
> unannotated code, must implementations support annotation?
> I believe that annotation support should be required, at the very
> least for formal verification, which can be quite difficult to
> carry out on unannotated code. In addition, it seems likely
> that annotations can enable much better diagnostics. P0190R0
> therefore recommends this approach.
> 
> o If implementations must support annotation, what form should that
> annotation take? P0190R0 recommends the [[carries_dependency]]
> attribute, but I am not picky as long as it can be (1) applied
> to all relevant pointer-like objects and (2) used in C as well
> as C++. ;-)
> 
> o If memory_order_consume dependency ordering can be used in
> unannotated code, how best to define the situations where
> the compiler can determine the exact value of the pointer in
> question? (In current defacto implementations, this can
> defeat dependency ordering. Interestingly enough, this case
> is not present in the Linux kernel, but still needs to be
> defined.)
> 
> Options include:
> 
> o Provide new intrinsics that carry out the
> comparisons, but guarantee to preserve dependencies,
> as recommended by P0190R0 (std::pointer_cmp_eq_dep(),
> std::pointer_cmp_ne_dep(), std::pointer_cmp_gt_dep(),
> std::pointer_cmp_ge_dep(), std::pointer_cmp_lt_dep(),
> and std::pointer_cmp_le_dep()).
> 
> o State that -any- comparison involving an unannotated
> pointer loses the dependency.
> 
> o How is the common idiom of marking pointers by setting low-order
> bits to be supported when those pointers carry dependencies?
> At the moment, I believe that setting bits in pointers results in
> undefined behavior even without dependency ordering, so P0190R0
> kicks this particular can down the road. One option that
> has been suggested is to provide intrinsics for this purpose.
> (Sorry, but I forget who suggested this.)
>

Re: [c++std-parallel-2008] Re: Compilers and RCU readers: Once more unto the breach!

2015-09-23 Thread Paul E. McKenney
r choices, regardless
of my opinions.)

That marking could be the [[carries_dependency]] attribute (at least
assuming that C adds attributes), a type modifier (which might or might
not go over well in Core), a variable modifier (which I don't understand
well enough to have an opinion), or a storage class as suggested in 7.10.

I am not all that worried about exactly what sort of marking we use,
but I am convinced that marking is required.

So, given that we need to mark things, what sort of marking would work
best from your perspective?

    Thanx, Paul

> On Tue, Sep 22, 2015 at 10:00 AM, Paul E. McKenney <
> paul...@linux.vnet.ibm.com> wrote:
> 
> > On Mon, Jul 13, 2015 at 05:44:59PM -0700, Paul E. McKenney wrote:
> > > On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote:
> > > > Hello!
> > > >
> > > > Following up on last year's discussion (
> > https://lwn.net/Articles/586838/,
> > > > https://lwn.net/Articles/588300/), I believe that we have a
> > solution.  If
> > > > I am wrong, I am sure you all will let me know, and in great detail.
> > ;-)
> > > >
> > > > The key simplification is to "just say no" to RCU-protected array
> > indexes:
> > > > https://lkml.org/lkml/2015/5/12/827, as was suggested by several
> > people.
> > > > This simplification means that rcu_dereference (AKA
> > memory_order_consume)
> > > > need only return pointers.  This in ture avoids things like (x-x),
> > > > (x*0), and (x%1) because if "x" is a pointer, these expressions either
> > > > return non-pointers are compilation errors.  With a very few
> > exceptions,
> > > > dependency chains can lead -to- non-pointers, but cannot pass -through-
> > > > them.
> > > >
> > > > The result is that dependencies are carried only by operations for
> > > > which the compiler cannot easily optimize the away those dependencies,
> > > > these operations including simple assignment, integer offset (including
> > > > indexing), dereferencing, casts, passing as a function argument, return
> > > > values from functions and so on.  A complete list with commentary
> > starts
> > > > on page 28 of:
> > > >
> > > > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
> > >
> > > And an update is available here:
> > >
> > >   http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> > >
> > > Among other things, this update addresses the points about equality
> > > comparisons introduced by the compiler, and outlines some of the
> > > issues head-/tail-marked alternatives face with respect to abstraction.
> > > The updated "Restricted Dependency Chains" section starts on page 28.
> > >
> > > Thoughts?
> >
> > This is updated based on many offline discussions.  Section 7.9 has
> > been updated accordingly, but still starts on page 28.  This approach is
> > again intended for large existing RCU code bases such as the Linux kernel.
> >
> > A new Section 7.10 starting on page 35 describes a possible longer-term
> > solution for new-to-RCU code bases.  This approach adds a storage class
> > that indicates that the marked object carries dependencies.  This allows
> > better diagnostics (and arguably better code self-documentation) as well
> > as providing something that formal-verification tools can handle while
> > avoiding the need for compilers to trace dependency chains.
> >
> > http://www.rdrop.com/users/paulmck/submission/consume.2015.09.22a.pdf
> >
> > Thoughts?
> >
> > Thanx, Paul
> >
> >
> >



Re: Compilers and RCU readers: Once more unto the breach!

2015-09-22 Thread Paul E. McKenney
On Mon, Jul 13, 2015 at 05:44:59PM -0700, Paul E. McKenney wrote:
> On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > Following up on last year's discussion (https://lwn.net/Articles/586838/,
> > https://lwn.net/Articles/588300/), I believe that we have a solution.  If
> > I am wrong, I am sure you all will let me know, and in great detail.  ;-)
> > 
> > The key simplification is to "just say no" to RCU-protected array indexes:
> > https://lkml.org/lkml/2015/5/12/827, as was suggested by several people.
> > This simplification means that rcu_dereference (AKA memory_order_consume)
> > need only return pointers.  This in ture avoids things like (x-x),
> > (x*0), and (x%1) because if "x" is a pointer, these expressions either
> > return non-pointers are compilation errors.  With a very few exceptions,
> > dependency chains can lead -to- non-pointers, but cannot pass -through-
> > them.
> > 
> > The result is that dependencies are carried only by operations for
> > which the compiler cannot easily optimize the away those dependencies,
> > these operations including simple assignment, integer offset (including
> > indexing), dereferencing, casts, passing as a function argument, return
> > values from functions and so on.  A complete list with commentary starts
> > on page 28 of:
> > 
> > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
> 
> And an update is available here:
> 
>   http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> 
> Among other things, this update addresses the points about equality
> comparisons introduced by the compiler, and outlines some of the
> issues head-/tail-marked alternatives face with respect to abstraction.
> The updated "Restricted Dependency Chains" section starts on page 28.
> 
> Thoughts?

This is updated based on many offline discussions.  Section 7.9 has
been updated accordingly, but still starts on page 28.  This approach is
again intended for large existing RCU code bases such as the Linux kernel.

A new Section 7.10 starting on page 35 describes a possible longer-term
solution for new-to-RCU code bases.  This approach adds a storage class
that indicates that the marked object carries dependencies.  This allows
better diagnostics (and arguably better code self-documentation) as well
as providing something that formal-verification tools can handle while
avoiding the need for compilers to trace dependency chains.

http://www.rdrop.com/users/paulmck/submission/consume.2015.09.22a.pdf

Thoughts?

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-07-13 Thread Paul E. McKenney
On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote:
 Hello!
 
 Following up on last year's discussion (https://lwn.net/Articles/586838/,
 https://lwn.net/Articles/588300/), I believe that we have a solution.  If
 I am wrong, I am sure you all will let me know, and in great detail.  ;-)
 
 The key simplification is to just say no to RCU-protected array indexes:
 https://lkml.org/lkml/2015/5/12/827, as was suggested by several people.
 This simplification means that rcu_dereference (AKA memory_order_consume)
 need only return pointers.  This in ture avoids things like (x-x),
 (x*0), and (x%1) because if x is a pointer, these expressions either
 return non-pointers are compilation errors.  With a very few exceptions,
 dependency chains can lead -to- non-pointers, but cannot pass -through-
 them.
 
 The result is that dependencies are carried only by operations for
 which the compiler cannot easily optimize the away those dependencies,
 these operations including simple assignment, integer offset (including
 indexing), dereferencing, casts, passing as a function argument, return
 values from functions and so on.  A complete list with commentary starts
 on page 28 of:
 
   http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

And an update is available here:

http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf

Among other things, this update addresses the points about equality
comparisons introduced by the compiler, and outlines some of the
issues head-/tail-marked alternatives face with respect to abstraction.
The updated Restricted Dependency Chains section starts on page 28.

Thoughts?

Thanx, Paul



Re: [c++std-parallel-1651] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-26 Thread Paul E. McKenney
On Tue, May 26, 2015 at 07:08:35PM +0200, Torvald Riegel wrote:
 On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: 
  http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
 
 I have been discussing Section 7.9 with Paul during last week.
 
 While I think that 7.9 helps narrow down the problem somewhat, I'm still
 concerned that it effectively requires compilers to either track
 dependencies or conservatively prevent optimizations like value
 speculation and specialization based on that.  Neither is a good thing
 for a compiler.

I do believe that we can find some middle ground.

 7.9 adds requirements that dependency chains stop if the program itself
 informs the compiler about the value of something in the dependency
 chain (e.g., as shown in Figure 33).  Also, if a correct program that
 does not have undefined behavior must use a particular value, this is
 also considered as informing the compiler about that value.  For
 example:
   int arr[2];
   int* x = foo.load(mo_consume);
   if (x  arr)   // implies same object/array, so x is in arr[]
 int r1 = *x; // compiler knows x == arr + 1
 The program, assuming no undefined behavior, first tells the compiler
 that x should be within arr, and then the comparison tells the compiler
 that x!=arr, so x==arr+1 must hold because there are just two elements
 in the array.

The updated version of Section 7.9 says that if undefined behavior
allows the compiler to deduce the exact pointer value, as in the
case you show above, the dependency chain is broken.

 Having these requirements is generally good, but we don't yet know how
 to specify this properly.  For example, I suppose we'd need to also say
 that the compiler cannot assume to know anything about a value returned
 from an mo_consume load; otherwise, nothing prevents a compiler from
 using knowledge about the stores that the mo_consume load can read from
 (see Section 7.2).

I expect that the Linux kernel's rcu_dereference() primitives would
map to volatile memory_order_consume loads for this reason.

 Also, a compiler is still required to not do value-speculation or
 optimizations based on that.  For example, this program:
 
 void op(type *p)
 {
   foo /= p-a;
   bar = p-b;
 }
 void bar()
 {
   pointer = ppp.load(mo_consume);
   op(pointer);
 }
 
 ... must not be transformed into this program, even if the compiler
 knows that global_var-a == 1:
 
 void op(type *p) { /* unchanged */}
 void bar()
 {
 pointer = ppp.load(mo_consume);
   if (pointer != global_var) {
 op(pointer);
   else // specialization for global_var
 {
   // compiler knows global_var-a==1;
   // compiler uses global_var directly, inlines, optimizes:
   bar = global_var-b;
 }
 
 The compiler could optimize out the division if pointer==global_var but
 it must not access field b directly through global_var.  This would be
 pretty awkwaard; the compiler may work based on an optimized expression
 in the specialization (ie, create code that assumes global_var instead
 of pointer) but it would still have to carry around and use the
 non-optimized expression.

Exactly how valuable is this sort of optimization in real code?  And
how likely is the compiler to actually be able to apply it?

(I nevertheless will take another pass through the Linux kernel looking
for global variables being added to RCU-protected linked structures.
Putting a global into an RCU-protected structure seems more likely than
is an RCU-protected pointer into a two-element array.)

 This wouldn't be as bad if it were easily constrained to code sequences
 that really need the dependencies.  However, 7.9 does not effectively
 contain dependencies to only the code that really needs them, IMO.
 Unless the compiler proves otherwise, it has to assume that a load from
 a pointer carries a dependency.  Proving that is often very hard because
 it requires points-to analysis; 7.9 restricts this to intra-thread
 analysis but that is still nontrivial.
 Michael Matz' had a similar concern (in terms of what it results in).

Again, I will be looking through the Linux kernel for vulnerabilities to
this sort of transformation.  However, I am having a very hard time seeing
how the compiler is going to know that much about the vast majority of
the Linux-kernel use cases.  The linked structures are allocated on the
heap, not in arrays or in globals.

 Given that mo_consume is useful but a very specialized feature, I
 wouldn't be optimistic that 7.9 would actually be supported by many
 compilers.  The trade-off between having to track dependencies or having
 to disallow optimizations is a bad one to make.  The simple way out for
 a compiler would be to just emit mo_acquire instead of mo_consume and be
 done with all -- and this might be the most practical decision overall,
 or the default general-purpose implementation.  At least I haven't heard
 any compiler implementer say that they think it's obviously worth
 implementing.

I believe that we need

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Paul E. McKenney
On Fri, May 22, 2015 at 08:43:44AM +0200, Ingo Molnar wrote:
 
 * Linus Torvalds torva...@linux-foundation.org wrote:
 
   (a) the official rules are completely pointless, and make sense 
  only because the standard is written for some random abstract 
  machine that doesn't actually exist.
 
 Presuming the intent of the abstract machine specification is to avoid 
 being seen as biased towards any specific machine (politics), maybe 
 write this as:
 
(a) the official rules are written for a somewhat weird and 
complex union of all known and theoretically possible CPU 
architectures that exist or which might exist in the future, 
which machine does not actually exist in practice, but which 
allows a single abstract set of rules to apply to all machines. 
These rules are complex, but if applied to a specific machine 
they become considerably simpler. Here's a few examples: ...
 
 ?
 
 (Assuming it's a goal of this standard to be human parseable to more 
 than a few dozen people on the planet.)

Should something based on Section 7.9 go in, then I would need to add
a more developer-friendly explanation in Documentation/RCU, no two
ways about it!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Paul E. McKenney
On Fri, May 22, 2015 at 06:43:32AM -0400, Richard Kenner wrote:
  (Assuming it's a goal of this standard to be human parseable to more 
  than a few dozen people on the planet.)
 
 Unfortunately, that's rarely a goal of most standards. ;-)

My experience does match Richard's, sad to say.  There has been some
vigorous discussion in another thread involving undefined behavior and
value narrowing, which has resulted in some useful changes to Section 7.9.
The updated document is here:

http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.22a.pdf

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-22 Thread Paul E. McKenney
On Fri, May 22, 2015 at 06:30:29PM +0100, Will Deacon wrote:
 Hi Paul,
 
 On Thu, May 21, 2015 at 09:02:12PM +0100, Paul E. McKenney wrote:
  On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
   On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
On to #5:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);

r2 = atomic_load_explicit(y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(x, 42, memory_order_relaxed);

The first thread's accesses are dependency ordered.  The second thread's
ordering is in a corner case that memory-barriers.txt does not cover.
You are supposed to start control dependencies with READ_ONCE_CTRL(), 
not
a memory_order_consume load (AKA rcu_dereference and friends).  However,
Alpha would have a full barrier as part of the memory_order_consume 
load,
and the rest of the processors would (one way or another) respect the
control dependency.  And the compiler would have some fun trying to
break it.
   
   But this is interesting because the first thread is ordered whilst the
   second is not, so doesn't that effectively forbid the compiler from
   constant-folding values if it can't prove that there is no dependency
   chain?
  
  You lost me on this one.  Are you suggesting that the compiler
  speculate the second thread's atomic store?  That would be very
  bad regardless of dependency chains.
  
  So what constant-folding optimization are you thinking of here?
  If the above example is not amenable to such an optimization, could
  you please give me an example where constant folding would apply
  in a way that is sensitive to dependency chains?
 
 Unless I'm missing something, I can't see what would prevent a compiler
 from looking at the code in thread1 and transforming it into the code in
 thread2 (i.e. constant folding r1 with 42 given that the taken branch
 must mean that r1 == 42). However, such an optimisation breaks the
 dependency chain, which means that a compiler needs to walk backwards
 to see if there is a dependency chain extending to r1.

Indeed!  Which is one reason that (1) integers are not allowed in
dependency chains with a very few extremely constrained exceptions and
(2) sequences of comparisons and/or undefined-behavior considerations
that allow the compiler to exactly determine the pointer value break
the dependency chain.

So the current Linux memory model would allow (r1 == 42  r2 == 42),
but I don't know of any hardware/compiler combination that would
allow it.  And no, I am -not- going to update memory-barriers.txt for
this litmus test, its theoretical interest notwithstanding!  ;-)
 
 Of course, I'm not asking for that at all! I'm just trying to see how
 your proposal holds up with the example.

Whew!  ;-)

Thanx, Paul



Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 06:17:43PM +0200, Michael Matz wrote:
 Hi,
 
 On Thu, 21 May 2015, Paul E. McKenney wrote:
 
  The point is -exactly- to codify the current state of affairs.
 
 Ah, I see, so it's not yet about creating a more useful (for compilers, 
 that is) model.

There are several approaches being considered for that as well, but we
do need to codify current usage.

 char * fancy_assign (char *in) { return in; }
 ...
 char *x, *y;
 
 x = atomic_load_explicit(p, memory_order_consume);
 y = fancy_assign (x);
 atomic_store_explicit(q, y, memory_order_relaxed);
   
   So, is there, or is there not a dependency carried from x to y in your 
   proposed model (and which rule in your document states so)?  Clearly, 
   without any other language the compiler would have to assume that there 
   is 
   (because the equivalent 'y = x' assignment would carry the dependency).  
  
  The dependency is not carried, though this is due to the current set
  of rules not covering atomic loads and stores, which I need to fix.
 
 Okay, so with the current regime(s), the dependency carries ...

Yes, that is the intent.

  o   Rule 14 says that if a value is part of a dependency chain and
  is used as the actual parameter of a function call, then the
  dependency chain extends to the corresponding formal parameter,
  namely in of fancy_assign().
  
  o   Rule 15 says that if a value is part of a dependency chain and
  is returned from a function, then the dependency chain extends
  to the returned value in the calling function.
  
  o   And you are right.  I need to make the first and second rules
  cover the relaxed atomic operations, or at least atomic loads and
  stores.  Not that this is an issue for existing Linux-kernel code.
  
  But given such a change, the new version of rule 2 would
  extend the dependency chain to cover the atomic_store_explicit().
 
 ... (if this detail would be fixed).  Okay, that's quite awful ...
 
   If it has to assume this, then the whole model is not going to work 
   very well, as usual with models that assume a certain less-optimal 
   fact (carries-dep is less optimal for code generation purposes that 
   not-carries-dep) unless very specific circumstances say it can be 
   ignored.
  
  Although that is a good general rule of thumb, I do not believe that it 
  applies to this situation, with the exception that I do indeed assume 
  that no one is insane enough to do value-speculation optimizations for 
  non-NULL values on loads from pointers.
  
  So what am I missing here?
 
 ... because you are then missing that if carries-dep can flow through 
 function calls from arguments to return values by default, the compiler 
 has to assume this in fact always happens when it can't see the function 
 body, or can't analyze it.  In effect that's making the whole carries-dep 
 stops at these and those uses a useless excercise because a malicious 
 user (malicious in the sense of abusing the model to show that it's 
 hindering optimizations), i.e. me, can hide all such carries-dep stopping 
 effects inside a function, et voila, the dependecy carries through.  So 
 for a slightly more simple example:
 
   extern void *foo (void *);  // body not available
   x = load
   y = foo (x);
   store (y)
 
 the compiler has to assume that there's a dep-chain from x to y; always.  

Yes, the compiler does have to make this assumption.  And the intent
behind the rules is to ensure that this assumption does not get in the
way of reasonable optimizations.  So although I am sure that you are as
busy as the rest of us, I really do need you to go through the rules in
detail before you get too much more excited about this.

 What's worse, it also has to assume a carries-dep for this:
 
   extern void foo (void *in, void **out1, void **out2);
   x = load
   foo (x, o1, o2);
   store (o1);
   store (o2);
 
 Now the compiler has to assume that the body of 'foo' is just mean enough 
 to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to 
 assume that for both).  This extends to _all_ memory accessible from foo's 
 body, i.e. generally all global and all local address-taken variables, so 
 as soon as you have a function call into which a dep-chain value flows 
 you're creating a dep-chain extension from that value to each and every 
 global piece of memory, because the compiler cannot assume that the 
 black box called foo is not mean.  This could conceivably be stopped by 
 making normal stores not to carry the dependency; then only the return 
 value might be infected; but I don't see that in your rules, as a normal 
 store is just an assigment in your model and hence rules 1 and 2 apply 
 (that is, carries-dep flows through all assignments, incl. loads and 
 stores).
 
 Basically whenever you can construct black boxes for the compiler, you 
 have to limit their effects on such transitive relations like carries-dep 
 by default

Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote:
 On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote:
  On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
   On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
Indeed, something like this does -not- carry a dependency from the
memory_order_consume load to q:

char *p, q;

p = atomic_load_explicit(gp, memory_order_consume);
q = gq + (intptr_t)p - (intptr_t)p;

If this was compiled with -O0, ARM and Power might well carry a
dependency, but given any optimization, the assembly language would have
no hint of any such dependency.  So I am not seeing any particular 
danger.
   
   The above is a welcome relaxation over C11, since ARM doesn't even give
   you ordering based off false data dependencies. My concern is more to do
   with how this can be specified precisely without prohibing honest compiler
   and hardware optimisations.
  
  That last is the challenge.  I believe that I am pretty close, but I am
  sure that additional adjustment will be required.  Especially given that
  we also need the memory model to be amenable to formal analysis.
 
 Well, there's still the whole thin-air problem which unfortunately doesn't
 go away with your proposal... (I was hoping that differentiating between
 true and false dependencies would solve that, but your set of rules isn't
 broad enough and I don't blame you at all for that!).
 
   Out of interest, how do you tackle examples (4) and (5) of (assuming the
   reads are promoted to consume loads)?:
   
 http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
   
   my understanding is that you permit both outcomes (I appreciate you're
   not directly tackling out-of-thin-air, but treatment of dependencies
   is heavily related).
 
 Thanks for taking the time to walk these two examples through.

;-) ;-) ;-)

  Let's see...  #4 is as follows, given promotion to memory_order_consume
  and (I am guessing) memory_order_relaxed:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  --
  r2 = atomic_load_explicit(y, memory_order_consume);
  if (r2 == 42)
atomic_store_explicit(x, 42, memory_order_relaxed);
  else
atomic_store_explicit(x, 42, memory_order_relaxed);
  
  The second thread does not have a proper control dependency, even with
  the memory_order_consume load because both branches assign the same
  value to x.  This means that the compiler is within its rights to
  optimize this into the following:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  --
  r2 = atomic_load_explicit(y, memory_order_consume);
  atomic_store_explicit(x, 42, memory_order_relaxed);
  
  There is no dependency between the second thread's pair of statements,
  so both the compiler and the CPU are within their rights to optimize
  further as follows:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  --
  atomic_store_explicit(x, 42, memory_order_relaxed);
  r2 = atomic_load_explicit(y, memory_order_consume);
  
  If the compiler makes this final optimization, even mythical SC hardware
  is within its rights to end up with (r1 == 42  r2 == 42).  Which is
  fine, as far as I am concerned.  Or at least something that can be
  lived with.
 
 Agreed.
 
  On to #5:
  
  r1 = atomic_load_explicit(x, memory_order_consume);
  if (r1 == 42)
atomic_store_explicit(y, r1, memory_order_relaxed);
  
  r2 = atomic_load_explicit(y, memory_order_consume);
  if (r2 == 42)
atomic_store_explicit(x, 42, memory_order_relaxed);
  
  The first thread's accesses are dependency ordered.  The second thread's
  ordering is in a corner case that memory-barriers.txt does not cover.
  You are supposed to start control dependencies with READ_ONCE_CTRL(), not
  a memory_order_consume load (AKA rcu_dereference and friends).  However,
  Alpha would have a full barrier as part of the memory_order_consume load,
  and the rest of the processors would (one way or another) respect the
  control dependency.  And the compiler would have some fun trying to
  break it.
 
 But this is interesting because the first thread is ordered whilst the
 second is not, so doesn't that effectively forbid the compiler from
 constant-folding values if it can't prove that there is no dependency
 chain?

You lost me on this one.  Are you suggesting that the compiler
speculate the second thread's atomic store?  That would

Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 04:22:38PM +0200, Michael Matz wrote:
 Hi,
 
 On Wed, 20 May 2015, Paul E. McKenney wrote:
 
I'm not sure... you'd require the compiler to perform static analysis of
loops to determine the state of the machine when they exit (if they 
exit!)
in order to show whether or not a dependency is carried to subsequent
operations. If it can't prove otherwise, it would have to assume that a
dependency *is* carried, and it's not clear to me how it would use this
information to restrict any subsequent dependency removing 
optimisations.
   
   It'd just convert consume to acquire.
  
  It should not need to, actually.
 
 [with GCC hat, and having only lightly read your document]

Understood.

 Then you need to provide language or at least informal reasons why the 
 compiler is allowed to not do that.  Without that a compiler would have to 
 be conservative, if it can't _prove_ that a dependency chain is stopped, 
 then it has to assume it hasn't.
 
 For instance I can't really make out easily what your document says about 
 the following simple situation (well, actually I have difficulties to 
 differ between what you're proposing as the good-new model of this all, 
 and what you're merely describing as different current states of affair):

The point is -exactly- to codify the current state of affairs.  I expect
a follow-on effort to specify some sort of marking regimen, as noted in
the last paragraph of 7.9 and as discussed with Torvald Riegel.  However,
given that there are not yet any implementations or practical experience
with such markings, I suspect that some time will be required to hammer
out a good marking scheme.

   char * fancy_assign (char *in) { return in; }
   ...
   char *x, *y;
   
   x = atomic_load_explicit(p, memory_order_consume);
   y = fancy_assign (x);
   atomic_store_explicit(q, y, memory_order_relaxed);
 
 So, is there, or is there not a dependency carried from x to y in your 
 proposed model (and which rule in your document states so)?  Clearly, 
 without any other language the compiler would have to assume that there is 
 (because the equivalent 'y = x' assignment would carry the dependency).  

The dependency is not carried, though this is due to the current set
of rules not covering atomic loads and stores, which I need to fix.
Here is the sequence of events:

o   A memory_order_consume load heads a dependency chain.

o   Rule 2 says that if a value is part of a dependency chain and
is used as the right-hand side of an assignment operator,
the expression extends the chain to cover the assignment.
And I switched to numbered bullet items here:
http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.21a.pdf

o   Rule 14 says that if a value is part of a dependency chain and
is used as the actual parameter of a function call, then the
dependency chain extends to the corresponding formal parameter,
namely in of fancy_assign().

o   Rule 15 says that if a value is part of a dependency chain and
is returned from a function, then the dependency chain extends
to the returned value in the calling function.

o   And you are right.  I need to make the first and second rules
cover the relaxed atomic operations, or at least atomic loads and
stores.  Not that this is an issue for existing Linux-kernel code.

But given such a change, the new version of rule 2 would
extend the dependency chain to cover the atomic_store_explicit().

 If it has to assume this, then the whole model is not going to work very 
 well, as usual with models that assume a certain less-optimal fact 
 (carries-dep is less optimal for code generation purposes that 
 not-carries-dep) unless very specific circumstances say it can be 
 ignored.

Although that is a good general rule of thumb, I do not believe that it
applies to this situation, with the exception that I do indeed assume
that no one is insane enough to do value-speculation optimizations for
non-NULL values on loads from pointers.

So what am I missing here?  Do you have a specific example where the
compiler would need to suppress a production-quality optimization?

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Paul E. McKenney
On Thu, May 21, 2015 at 01:42:11PM -0700, Linus Torvalds wrote:
 On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  The compiler can (and does) speculate non-atomic non-volatile writes
  in some cases, but I do not believe that it is permitted to speculate
  either volatile or atomic writes.
 
 I do *not* believe that a compiler is ever allowed to speculate *any*
 writes - volatile or not - unless the compiler can prove that the end
 result is either single-threaded, or the write in question is
 guaranteed to only be visible in that thread (ie local stack variable
 etc).
 
 Quite frankly, I'd be much happier if the C standard just said so outright.

So would I.  ;-)

The usual example is the following, where x is non-volatile and
non-atomic:

if (a)
x = 42;
else
x = 7;

The current rules allow this to be transformed as follows:

x = 7;
if (a)
x = 42;

So the variable x takes on the value 7 momentarily when it otherwise
would not have.

At least C11 now prohibits drive-by speculative writes, where the
compiler writes a larger size and then fixes things up afterwards.

 Also, I do think that the whole consume read should be explained
 better to compiler writers. Right now the language (including very
 much in the restricted dependency model) is described in very
 abstract terms. Yet those abstract terms are actually very subtle and
 complex, and very opaque to a compiler writer.
 
 If I was a compiler writer, I'd absolutely detest that definition.
 It's very far removed from my problem space as a compiler writer, and
 nothing in the language *explains* the odd and subtle abstract rules.
 It smells ad-hoc to me.
 
 Now, I actually understand the point of those odd and abstract rules,
 but to a compiler writer that doesn't understand the background, the
 whole section reads as this is really painful for me to track all
 those dependencies and what kills them.
 
 So I would very much suggest that there would be language that
 *explains* this. Basically, tell the compiler writer:
 
  (a) the official rules are completely pointless, and make sense
 only because the standard is written for some random abstract
 machine that doesn't actually exist.
 
  (b) in *real life*, the whole and only point of the rules is to make
 sure that the compiler doesn't turn a data depenency into a control
 dependency, which on ARM and POWERPC does not honor causal memory
 ordering
 
  (c) on x86, since *all* memory accesses are causal, all the magical
 dependency rules are just pointless anyway, and what it really means
 is that you cannot re-order accesses with value speculation.
 
  (c) the *actual* relevant rule for a compiler writer is very simple:
 the compiler must not do value speculation on a consume load, and
 the abstract machine rules are written so that any other sane
 optimization is legal.
 
  (d) if the compiler writer really thinks they want to do value
 speculation, they have to turn the consume load into an acquire
 load. And you have to do that anyway on architectures like alpha that
 aren't causal even for data dependencies.

I am certainly more than willing to add this sort of wording to
the document.

 I personally think the whole abstract machine model of the C
 language is a mistake. It would be much better to talk about things in
 terms of actual code generation and actual issues. Make all the
 problems much more concrete, with actual examples of how memory
 ordering matters on different architectures.
 
 99% of all the problems with the whole consume memory ordering comes
 not from anything relevant to a compiler writer. All of it comes from
 trying to define the issue in the wrong terms.

I certainly cannot deny that there seems to be significant confusion
in the discussion thus far.

Thanx, Paul



Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 11:03:00AM +0200, Richard Biener wrote:
 On Wed, May 20, 2015 at 9:34 AM, Jens Maurer jens.mau...@gmx.net wrote:
  On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
  On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
 
   - the you can add/subtract integral values still opens you up to
  language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
  dependency, which it clearly doesn't. But language-lawyering it does,
  since all those operations (cast to pointer, cast to integer,
  subtracting an integer) claim to be dependency-preserving operations.
 
  [...]
 
  There are some stranger examples, such as (char *)ptr - 
  ((intptr_t)ptr)/7,
  but in that case, if the resulting pointer happens by chance to reference
  valid memory, I believe a dependency would still be carried.
  [...]
 
  From a language lawyer standpoint, pointer arithmetic is only valid
  within an array.  These examples seem to go beyond the bounds of the
  array and therefore have undefined behavior.
 
  C++ standard section 5.7 paragraph 4
  If both the pointer operand and the result point to elements of the
  same array object, or one past the last element of the array object,
  the evaluation shall not produce an overflow; otherwise, the behavior
  is undefined.
 
  C99 and C11
  identical phrasing in 6.5.6 paragraph 8
 
 Of course you can try to circumvent that by doing
 (char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr)
 (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun).
 
 Which (IMHO) gets you into the standard language that only makes conversion of
 the exact same integer back to a pointer well-defined(?)

I am feeling good about leaving the restriction and calling out
the two paragraphs in a footnote, then.  ;-)

Thanx, Paul



Re: [c++std-parallel-1616] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 09:34:10AM +0200, Jens Maurer wrote:
 On 05/20/2015 04:34 AM, Paul E. McKenney wrote:
  On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
 
   - the you can add/subtract integral values still opens you up to
  language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
  dependency, which it clearly doesn't. But language-lawyering it does,
  since all those operations (cast to pointer, cast to integer,
  subtracting an integer) claim to be dependency-preserving operations.
 
 [...]
 
  There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7,
  but in that case, if the resulting pointer happens by chance to reference 
  valid memory, I believe a dependency would still be carried.
 [...]
 
 From a language lawyer standpoint, pointer arithmetic is only valid
 within an array.  These examples seem to go beyond the bounds of the
 array and therefore have undefined behavior.
 
 C++ standard section 5.7 paragraph 4
 If both the pointer operand and the result point to elements of the
 same array object, or one past the last element of the array object,
 the evaluation shall not produce an overflow; otherwise, the behavior
 is undefined.
 
 C99 and C11
 identical phrasing in 6.5.6 paragraph 8

Even better!  I added a footnote calling out these two paragraphs.

Thax, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
 Hi Paul,
 
 On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
  On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
   On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
   torva...@linux-foundation.org wrote:
   So I think you're better off just saying that operations designed to
   drop significant bits break the dependency chain, and give things like
1 and (char *)ptr-(uintptr_t)ptr as examples of such.
   
   Making that just an extension of your existing  0 language would
   seem to be natural.
  
  Works for me!  I added the following bullet to the list of things
  that break dependencies:
  
  If a pointer is part of a dependency chain, and if the values
  added to or subtracted from that pointer cancel the pointer
  value so as to allow the compiler to precisely determine the
  resulting value, then the resulting value will not be part of
  any dependency chain.  For example, if p is part of a dependency
  chain, then ((char *)p-(uintptr_t)p)+65536 will not be.
  
  Seem reasonable?
 
 Whilst I understand what you're saying (the ARM architecture makes these
 sorts of distinctions when calling out dependency-based ordering), it
 feels like we're dangerously close to defining the difference between a
 true and a false dependency. If we want to do this in the context of the
 C language specification, you run into issues because you need to evaluate
 the program in order to determine data values in order to determine the
 nature of the dependency.

Indeed, something like this does -not- carry a dependency from the
memory_order_consume load to q:

char *p, q;

p = atomic_load_explicit(gp, memory_order_consume);
q = gq + (intptr_t)p - (intptr_t)p;

If this was compiled with -O0, ARM and Power might well carry a
dependency, but given any optimization, the assembly language would have
no hint of any such dependency.  So I am not seeing any particular danger.

 You tackle this above by saying to allow the compiler to precisely
 determine the resulting value, but I can't see how that can be cleanly
 fitted into something like the C language specification.

I am sure that there will be significant rework from where this document
is to language appropriate from the standard.  Which is why I am glad
that Jens is taking an interest in this, as he is particularly good at
producing standards language.

  Even if it can,
 then we'd need to reword the ?: treatment that you currently have:
 
   If a pointer is part of a dependency chain, and that pointer appears
in the entry of a ?: expression selected by the condition, then the
chain extends to the result.
 
 which I think requires the state of the condition to be known statically
 if we only want to extend the chain from the selected expression. In the
 general case, wouldn't a compiler have to assume that the chain is
 extended from both?

In practice, yes, if the compiler cannot determine which expression is
selected, it must arrange for the dependency to be carried from either,
depending on the run-time value of the condition.  But you would have
to work pretty hard to create code that did not carry the dependencies
as require, not?

 Additionally, what about the following code?
 
   char *x = y ? z : z;
 
 Does that extend a dependency chain from z to x? If so, I can imagine a
 CPU breaking that in practice.

I am not seeing this.  I would expect the compiler to optimize to
something like this:

char *x = z;

How does this avoid carrying the dependency?  Or are you saying that
ARM loses the dependency via a store to memory and a later reload?
That would be a bit surprising...

   Humans will understand, and compiler writers won't care. They will
   either depend on hardware semantics anyway (and argue that your
   language is tight enough that they don't need to do anything special)
   or they will turn the consume into an acquire (on platforms that have
   too weak hardware).
  
  Agreed.  Plus Core Working Group will hammer out the exact wording,
  should this approach meet their approval.
 
 For the avoidance of doubt, I'm completely behind any attempts to tackle
 this problem, but I anticipate an uphill struggle getting this text into
 the C standard. Is your intention to change the carries-a-dependency
 relation to encompass this change?

I completely agree that this won't be easy, but this is the task at hand.
And yes, the intent is to change carries-a-dependency, given that the
current wording isn't helping anything.  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:18:37PM +0100, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
   Additionally, what about the following code?
   
 char *x = y ? z : z;
   
   Does that extend a dependency chain from z to x? If so, I can imagine a
   CPU breaking that in practice.
  
  I am not seeing this.  I would expect the compiler to optimize to
  something like this:
  
  char *x = z;
 
 Why?  What if y has a potential side-effect (say it makes a function call)?

I was thinking of y as a simple variable, but if it is something more
complex, then the compiler could do this, right?

char *x;

y;
x = z;

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
 
 
 On 20/05/15 14:37, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 I was thinking of y as a simple variable, but if it is something more
 complex, then the compiler could do this, right?
 
 char *x;
 
 y;
 x = z;
 
 Yeah.  I presume it has to maintain the ordering, though.
 
 The scheduler for e.g. is free to reorder if it can prove there is
 no dependence (or indeed side-effects for y) between insns produced
 for y and `x = z'.

So for example, if y is independent of z, the compiler can do the
following:

char *x;

x = z;
y;

But the dependency ordering is still maintained from z to x, so this
is not a problem.

Or am I missing something subtle here?

Thanx, Paul



Re: [c++std-parallel-1624] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 02:37:05PM +0100, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
  I was thinking of y as a simple variable, but if it is something more
  complex, then the compiler could do this, right?
  
  char *x;
  
  y;
  x = z;
 
 Yeah.  I presume it has to maintain the ordering, though.

Agreed.  Unless of course y writes to x or some such.

Given that there is already code in the Linux kernel relying on
dependencies being carried through stores to local variables,
this should not be a problem.

Or am I missing something?

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote:
 On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote:
  On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote:
   On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote:
If a pointer is part of a dependency chain, and if the values
added to or subtracted from that pointer cancel the pointer
value so as to allow the compiler to precisely determine the
resulting value, then the resulting value will not be part of
any dependency chain.  For example, if p is part of a dependency
chain, then ((char *)p-(uintptr_t)p)+65536 will not be.

Seem reasonable?
   
   Whilst I understand what you're saying (the ARM architecture makes these
   sorts of distinctions when calling out dependency-based ordering), it
   feels like we're dangerously close to defining the difference between a
   true and a false dependency. If we want to do this in the context of the
   C language specification, you run into issues because you need to evaluate
   the program in order to determine data values in order to determine the
   nature of the dependency.
  
  Indeed, something like this does -not- carry a dependency from the
  memory_order_consume load to q:
  
  char *p, q;
  
  p = atomic_load_explicit(gp, memory_order_consume);
  q = gq + (intptr_t)p - (intptr_t)p;
  
  If this was compiled with -O0, ARM and Power might well carry a
  dependency, but given any optimization, the assembly language would have
  no hint of any such dependency.  So I am not seeing any particular danger.
 
 The above is a welcome relaxation over C11, since ARM doesn't even give
 you ordering based off false data dependencies. My concern is more to do
 with how this can be specified precisely without prohibing honest compiler
 and hardware optimisations.

That last is the challenge.  I believe that I am pretty close, but I am
sure that additional adjustment will be required.  Especially given that
we also need the memory model to be amenable to formal analysis.

 Out of interest, how do you tackle examples (4) and (5) of (assuming the
 reads are promoted to consume loads)?:
 
   http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
 
 my understanding is that you permit both outcomes (I appreciate you're
 not directly tackling out-of-thin-air, but treatment of dependencies
 is heavily related).

Let's see...  #4 is as follows, given promotion to memory_order_consume
and (I am guessing) memory_order_relaxed:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);
--
r2 = atomic_load_explicit(y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(x, 42, memory_order_relaxed);
else
  atomic_store_explicit(x, 42, memory_order_relaxed);

The second thread does not have a proper control dependency, even with
the memory_order_consume load because both branches assign the same
value to x.  This means that the compiler is within its rights to
optimize this into the following:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);
--
r2 = atomic_load_explicit(y, memory_order_consume);
atomic_store_explicit(x, 42, memory_order_relaxed);

There is no dependency between the second thread's pair of statements,
so both the compiler and the CPU are within their rights to optimize
further as follows:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);
--
atomic_store_explicit(x, 42, memory_order_relaxed);
r2 = atomic_load_explicit(y, memory_order_consume);

If the compiler makes this final optimization, even mythical SC hardware
is within its rights to end up with (r1 == 42  r2 == 42).  Which is
fine, as far as I am concerned.  Or at least something that can be
lived with.

On to #5:

r1 = atomic_load_explicit(x, memory_order_consume);
if (r1 == 42)
  atomic_store_explicit(y, r1, memory_order_relaxed);

r2 = atomic_load_explicit(y, memory_order_consume);
if (r2 == 42)
  atomic_store_explicit(x, 42, memory_order_relaxed);

The first thread's accesses are dependency ordered.  The second thread's
ordering is in a corner case that memory-barriers.txt does not cover.
You are supposed to start control dependencies with READ_ONCE_CTRL(), not
a memory_order_consume load (AKA rcu_dereference and friends).  However,
Alpha would have a full barrier as part

Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 04:54:51PM +0100, Andrew Haley wrote:
 On 05/20/2015 04:46 PM, Will Deacon wrote:
  I'm not sure... you'd require the compiler to perform static analysis of
  loops to determine the state of the machine when they exit (if they exit!)
  in order to show whether or not a dependency is carried to subsequent
  operations. If it can't prove otherwise, it would have to assume that a
  dependency *is* carried, and it's not clear to me how it would use this
  information to restrict any subsequent dependency removing optimisations.
 
 It'd just convert consume to acquire.

It should not need to, actually.

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-20 Thread Paul E. McKenney
On Wed, May 20, 2015 at 03:15:48PM +0100, Ramana Radhakrishnan wrote:
 
 
 On 20/05/15 15:03, Paul E. McKenney wrote:
 On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote:
 
 
 On 20/05/15 14:37, David Howells wrote:
 Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
 
 I was thinking of y as a simple variable, but if it is something more
 complex, then the compiler could do this, right?
 
   char *x;
 
   y;
   x = z;
 
 Yeah.  I presume it has to maintain the ordering, though.
 
 The scheduler for e.g. is free to reorder if it can prove there is
 no dependence (or indeed side-effects for y) between insns produced
 for y and `x = z'.
 
 So for example, if y is independent of z, the compiler can do the
 following:
 
  char *x;
 
  x = z;
  y;
 
 But the dependency ordering is still maintained from z to x, so this
 is not a problem.
 
 
 Well, reads if any of x (assuming x was initialized elsewhere) would
 need to happen before x got assigned to z.

Agreed, there needs to be a memory_order_consume load up there somewhere.
(AKA rcu_dereference().)

 I understood the original maintain the ordering as between the
 statements `x = z' and `y'.

Ah, I was assuming between x and z.  David, what was your intent?  ;-)

 Or am I missing something subtle here?
 
 No, it sounds like we are on the same page here.

Whew!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Paul E. McKenney
On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote:
 On Tue, May 19, 2015 at 5:55 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf
 
 From a very quick read-through, the restricted dependency chain in 7.9
 seems to be reasonable, and essentially covers thats' what hardware
 gives us anyway, making compiler writers happy.
 
 I would clarify the language somewhat:
 
  - it says that the result of a cast of a pointer is a dependency. You
 need to make an exception for casting to bool, methinks (because
 that's effectively just a test-against-NULL, which you later describe
 as terminating the dependency).
 
Maybe get rid of the any type, and just limit it to casts to
 types of size intptr_t, ie ones that don't drop significant bits. All
 the other rules talk about [u]intptr_t anyway.

Excellent point!  I now say:

If a pointer is part of a dependency chain, then casting it
(either explicitly or implicitly) to any pointer-sized type
extends the chain to the result.

If this approach works out, the people in the Core Working Group will
come up with alternative language-lawyer-proof wording, but this informal
version will hopefully do for the moment.

  - you clarify that the trivial  0 and | ~0 kill the dependency
 chain, but if you really want to be a stickler, you might want to
 extend it to a few more cases. Things like  1 (to extract a tag
 from the lot bit of a tagged pointer) should likely also drop the
 dependency, since a compiler would commonly end up using the end
 result as a conditional even if the code was written to then use
 casting etc to look like a dereference.

Ah, how about the following?

If a value of type intptr_t or uintptr_t is part of a dependency
chain, and if that value is one of the operands to an  or |
infix operator whose result has too few or too many bits set,
then the resulting value will not be part of any dependency
chain.  For example, on a 64-bit system, if p is part of a
dependency chain, then (p  0x7) provides just the tag bits,
and normally cannot even be legally dereferenced.  Similarly,
(p | ~0) normally cannot be legally dereferenced.

  - the you can add/subtract integral values still opens you up to
 language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
 dependency, which it clearly doesn't. But language-lawyering it does,
 since all those operations (cast to pointer, cast to integer,
 subtracting an integer) claim to be dependency-preserving operations.

My thought was that the result of (char *)ptr - (intptr_t)ptr is a
NULL pointer in most environments, and dereferencing a NULL pointer
is undefined behavior.  So it becomes irrelevant whether or not the
NULL pointer carries a dependency.

There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7,
but in that case, if the resulting pointer happens by chance to reference 
valid memory, I believe a dependency would still be carried.  Of course,
if you are producing code like that, I am guessing that dependencies are
the very least of your concerns.

However, I will give this some more thought.

 So I think you want to limit the logical operators to things that
 don't mask off too many bits, and you should probably limit the
 add/subtract operations some way (maybe specify that the integer value
 you add/subtract cannot be related to the pointer). But I think
 limiting it to mostly pointer ops (and a _few_ integer operations to
 do offsets and remove tag bits) is otherwise a good approach.

Glad you mostly like it!  ;-)

Thanx, Paul



Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Paul E. McKenney
On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote:
 On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
   - the you can add/subtract integral values still opens you up to
  language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the
  dependency, which it clearly doesn't. But language-lawyering it does,
  since all those operations (cast to pointer, cast to integer,
  subtracting an integer) claim to be dependency-preserving operations.
 
  So I think you want to limit the logical operators to things that
  don't mask off too many bits, and you should probably limit the
  add/subtract operations some way (maybe specify that the integer value
  you add/subtract cannot be related to the pointer).
 
 Actually, not related doesn't work. For some buddy allocator thing,
 you very much might want some of the bits to be related.

Good point, you could do the buddy-allocator computations with add
and subtract instead of exclusive OR.

 So I think you're better off just saying that operations designed to
 drop significant bits break the dependency chain, and give things like
  1 and (char *)ptr-(uintptr_t)ptr as examples of such.
 
 Making that just an extension of your existing  0 language would
 seem to be natural.

Works for me!  I added the following bullet to the list of things
that break dependencies:

If a pointer is part of a dependency chain, and if the values
added to or subtracted from that pointer cancel the pointer
value so as to allow the compiler to precisely determine the
resulting value, then the resulting value will not be part of
any dependency chain.  For example, if p is part of a dependency
chain, then ((char *)p-(uintptr_t)p)+65536 will not be.

Seem reasonable?

 Humans will understand, and compiler writers won't care. They will
 either depend on hardware semantics anyway (and argue that your
 language is tight enough that they don't need to do anything special)
 or they will turn the consume into an acquire (on platforms that have
 too weak hardware).

Agreed.  Plus Core Working Group will hammer out the exact wording,
should this approach meet their approval.

Thanx, Paul



Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Paul E. McKenney
Hello!

Following up on last year's discussion (https://lwn.net/Articles/586838/,
https://lwn.net/Articles/588300/), I believe that we have a solution.  If
I am wrong, I am sure you all will let me know, and in great detail.  ;-)

The key simplification is to just say no to RCU-protected array indexes:
https://lkml.org/lkml/2015/5/12/827, as was suggested by several people.
This simplification means that rcu_dereference (AKA memory_order_consume)
need only return pointers.  This in ture avoids things like (x-x),
(x*0), and (x%1) because if x is a pointer, these expressions either
return non-pointers are compilation errors.  With a very few exceptions,
dependency chains can lead -to- non-pointers, but cannot pass -through-
them.

The result is that dependencies are carried only by operations for
which the compiler cannot easily optimize the away those dependencies,
these operations including simple assignment, integer offset (including
indexing), dereferencing, casts, passing as a function argument, return
values from functions and so on.  A complete list with commentary starts
on page 28 of:

http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

Dependency chains are broken if a pointer compares equal to some other
pointer not part of the same dependency chain, if too many bits are ORed
onto or ANDed off of a intptr_t or uintptr_t, or if the dependency is
explicitly killed (which should now strictly speaking never be necessary,
but which might allow better diagnostics).  These are set out in more
detail on page 30 of the above PDF.

This covers all the uses in the Linux kernel that I am aware of without
any source-code changes (other than to the rcu_dereference() primitives
themselves) and should also work for compilers and standards.

Thoughts?

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote:
 xagsmtp5.20140307174618.3...@vmsdvm6.vnet.ibm.com
 X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6)
 
 On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote:
  On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
   xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
   X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
   
   On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
 X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
 
 On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
   X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
   
   On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
+o  Do not use the results from the boolean  and || 
when
+   dereferencing.  For example, the following (rather 
improbable)
+   code is buggy:
+
+   int a[2];
+   int index;
+   int force_zero_index = 1;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
+
+   The reason this is buggy is that  and || are 
often compiled
+   using branches.  While weak-memory machines such as ARM 
or PowerPC
+   do order stores after such branches, they can speculate 
loads,
+   which can result in misordering bugs.
+
+o  Do not use the results from relational operators (==, 
!=,
+   , =, , or =) when dereferencing.  For 
example,
+   the following (quite strange) code is buggy:
+
+   int a[2];
+   int index;
+   int flip_index = 0;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1 != flip_index];  /* BUGGY!!! */
+
+   As before, the reason this is buggy is that relational 
operators
+   are often compiled using branches.  And as before, 
although
+   weak-memory machines such as ARM or PowerPC do order 
stores
+   after such branches, but can speculate loads, which can 
again
+   result in misordering bugs.
   
   Those two would be allowed by the wording I have recently 
   proposed,
   AFAICS.  r1 != flip_index would result in two possible values 
   (unless
   there are further constraints due to the type of r1 and the 
   values that
   flip_index can have).
  
  And I am OK with the value_dep_preserving type providing more/better
  guarantees than we get by default from current compilers.
  
  One question, though.  Suppose that the code did not want a value
  dependency to be tracked through a comparison operator.  What does
  the developer do in that case?  (The reason I ask is that I have
  not yet found a use case in the Linux kernel that expects a value
  dependency to be tracked through a comparison.)
 
 Hmm.  I suppose use an explicit cast to non-vdp before or after the
 comparison?

That should work well assuming that things like if, while, and ?:
conditions are happy to take a vdp.
   
   I currently don't see a reason why that should be disallowed.  If we
   have allowed an implicit conversion to non-vdp, I believe that should
   follow.
  
  I am a bit nervous about a silent implicit conversion from vdp to
  non-vdp in the general case.
 
 Why are you nervous about it?

If someone expects the vdp to propagate into some function that might
be compiled with aggressive optimizations that break this expectation,
it would be good for that someone to know about it.

Ah!  I am assuming that the compiler is -not- emitting memory barriers
at vdp-to-non-vdp transitions.  In that case, warnings are even more
important -- without the warnings, it is a real pain chasing these
unnecessary memory barriers out of the code.

So we are -not- in the business of emitting memory barriers on
vdp-to-non-vdp transitions, right?

  However, when the result is being used by
  a conditional, the silent implicit conversion makes a lot of sense.
  Is that distinction something that the compiler can handle easily?
 
 I think so. I'm not a language lawyer, but we have other such
 conversions in the standard (e.g., int to boolean, between int and
 float) and I currently don't see a fundamental difference to those.  But
 we'll have to ask

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-07 Thread Paul E. McKenney
On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote:
 On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote:
  On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
   On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
  X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
  
  On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)

On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
 +oDo not use the results from the boolean  and || 
 when
 + dereferencing.  For example, the following (rather 
 improbable)
 + code is buggy:
 +
 + int a[2];
 + int index;
 + int force_zero_index = 1;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
 +
 + The reason this is buggy is that  and || are 
 often compiled
 + using branches.  While weak-memory machines such as ARM 
 or PowerPC
 + do order stores after such branches, they can speculate 
 loads,
 + which can result in misordering bugs.
 +
 +oDo not use the results from relational operators (==, 
 !=,
 + , =, , or =) when dereferencing.  For 
 example,
 + the following (quite strange) code is buggy:
 +
 + int a[2];
 + int index;
 + int flip_index = 0;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1 != flip_index];  /* BUGGY!!! */
 +
 + As before, the reason this is buggy is that relational 
 operators
 + are often compiled using branches.  And as before, 
 although
 + weak-memory machines such as ARM or PowerPC do order 
 stores
 + after such branches, but can speculate loads, which can 
 again
 + result in misordering bugs.

Those two would be allowed by the wording I have recently 
proposed,
AFAICS.  r1 != flip_index would result in two possible values 
(unless
there are further constraints due to the type of r1 and the 
values that
flip_index can have).
   
   And I am OK with the value_dep_preserving type providing 
   more/better
   guarantees than we get by default from current compilers.
   
   One question, though.  Suppose that the code did not want a value
   dependency to be tracked through a comparison operator.  What does
   the developer do in that case?  (The reason I ask is that I have
   not yet found a use case in the Linux kernel that expects a value
   dependency to be tracked through a comparison.)
  
  Hmm.  I suppose use an explicit cast to non-vdp before or after the
  comparison?
 
 That should work well assuming that things like if, while, and 
 ?:
 conditions are happy to take a vdp.  This assumes that p-a only 
 returns
 vdp if field a is declared vdp, otherwise we have vdps running wild
 through the program.  ;-)
 
 The other thing that can happen is that a vdp can get handed off to
 another synchronization mechanism, for example, to reference counting:
 
   p = atomic_load_explicit(gp, memory_order_consume);
   if (do_something_with(p-a)) {
   /* fast path protected by RCU. */
   return 0;
   }
   if (atomic_inc_not_zero(p-refcnt) {
   /* slow path protected by reference counting. */
   return do_something_else_with((struct foo *)p);  /* 
 CHANGE */
   }
   /* Needed slow path, but raced with deletion. */
   return -EAGAIN;
 
 I am guessing that the cast ends the vdp.  Is that the case?

And here is a more elaborate example from the Linux kernel:

struct md_rdev value_dep_preserving *rdev;  /* CHANGE */

rdev = rcu_dereference(conf-mirrors[disk].rdev);
if (r1_bio-bios[disk] == IO_BLOCKED
|| rdev == NULL
|| test_bit(Unmerged, rdev-flags)
|| test_bit(Faulty, rdev-flags))
continue;

The fact that the rdev == NULL returns vdp does not force

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney
On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote:
 xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com
 X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC)
 
 On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
   X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
   
   On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
 X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
 
 On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
  +o  Do not use the results from the boolean  and || when
  +   dereferencing.  For example, the following (rather improbable)
  +   code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int force_zero_index = 1;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
  +
  +   The reason this is buggy is that  and || are often 
  compiled
  +   using branches.  While weak-memory machines such as ARM or 
  PowerPC
  +   do order stores after such branches, they can speculate loads,
  +   which can result in misordering bugs.
  +
  +o  Do not use the results from relational operators (==, !=,
  +   , =, , or =) when dereferencing.  For example,
  +   the following (quite strange) code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int flip_index = 0;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
  +
  +   As before, the reason this is buggy is that relational operators
  +   are often compiled using branches.  And as before, although
  +   weak-memory machines such as ARM or PowerPC do order stores
  +   after such branches, but can speculate loads, which can again
  +   result in misordering bugs.
 
 Those two would be allowed by the wording I have recently proposed,
 AFAICS.  r1 != flip_index would result in two possible values (unless
 there are further constraints due to the type of r1 and the values 
 that
 flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)
   
   Hmm.  I suppose use an explicit cast to non-vdp before or after the
   comparison?
  
  That should work well assuming that things like if, while, and ?:
  conditions are happy to take a vdp.
 
 I currently don't see a reason why that should be disallowed.  If we
 have allowed an implicit conversion to non-vdp, I believe that should
 follow.

I am a bit nervous about a silent implicit conversion from vdp to
non-vdp in the general case.  However, when the result is being used by
a conditional, the silent implicit conversion makes a lot of sense.
Is that distinction something that the compiler can handle easily?

On the other hand, silent implicit conversion from non-vdp to vdp
is very useful for common code that can be invoked both by RCU
readers and by updaters.

  ?: could be somewhat special, in that the type depends on the
 2nd and 3rd operand.  Thus, vdp x = non-vdp ? vdp : vdp; should be
 allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be
 disallowed if we don't provide for implicit casts from non-vdp to vdp.

Actually, from the Linux-kernel code that I am seeing, we want to be able
to silently convert from non-vdp to vdp in order to permit common code
that is invoked from both RCU readers (vdp) and updaters (often non-vdp).
This common code must be compiled conservatively to allow vdp, but should
be just find with non-vdp.

Going through the combinations...

 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */
 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */
 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */
 3. vdp x = vdp ? non-vdp : non-vdp; /* Silent conversion. */
 4. vdp x = non-vdp ? vdp : vdp; /* OK, matches. */
 5. vdp x = non-vdp ? vdp : non-vdp; /* Silent conversion. */
 6. vdp x = non-vdp ? non-vdp : vdp; /* Silent conversion. */
 7. vdp x = non-vdp ? non-vdp : non-vdp; /* Silent conversion. */
 8. non-vdp x = vdp ? vdp : vdp

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-05 Thread Paul E. McKenney
On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote:
 On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote:
  On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)

On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
  X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
  
  On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
   +oDo not use the results from the boolean  and || 
   when
   + dereferencing.  For example, the following (rather improbable)
   + code is buggy:
   +
   + int a[2];
   + int index;
   + int force_zero_index = 1;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
   +
   + The reason this is buggy is that  and || are often 
   compiled
   + using branches.  While weak-memory machines such as ARM or 
   PowerPC
   + do order stores after such branches, they can speculate loads,
   + which can result in misordering bugs.
   +
   +oDo not use the results from relational operators (==, 
   !=,
   + , =, , or =) when dereferencing.  For example,
   + the following (quite strange) code is buggy:
   +
   + int a[2];
   + int index;
   + int flip_index = 0;
   +
   + ...
   +
   + r1 = rcu_dereference(i1)
   + r2 = a[r1 != flip_index];  /* BUGGY!!! */
   +
   + As before, the reason this is buggy is that relational operators
   + are often compiled using branches.  And as before, although
   + weak-memory machines such as ARM or PowerPC do order stores
   + after such branches, but can speculate loads, which can again
   + result in misordering bugs.
  
  Those two would be allowed by the wording I have recently proposed,
  AFAICS.  r1 != flip_index would result in two possible values 
  (unless
  there are further constraints due to the type of r1 and the values 
  that
  flip_index can have).
 
 And I am OK with the value_dep_preserving type providing more/better
 guarantees than we get by default from current compilers.
 
 One question, though.  Suppose that the code did not want a value
 dependency to be tracked through a comparison operator.  What does
 the developer do in that case?  (The reason I ask is that I have
 not yet found a use case in the Linux kernel that expects a value
 dependency to be tracked through a comparison.)

Hmm.  I suppose use an explicit cast to non-vdp before or after the
comparison?
   
   That should work well assuming that things like if, while, and ?:
   conditions are happy to take a vdp.  This assumes that p-a only returns
   vdp if field a is declared vdp, otherwise we have vdps running wild
   through the program.  ;-)
   
   The other thing that can happen is that a vdp can get handed off to
   another synchronization mechanism, for example, to reference counting:
   
 p = atomic_load_explicit(gp, memory_order_consume);
 if (do_something_with(p-a)) {
 /* fast path protected by RCU. */
 return 0;
 }
 if (atomic_inc_not_zero(p-refcnt) {
 /* slow path protected by reference counting. */
 return do_something_else_with((struct foo *)p);  /* CHANGE */
 }
 /* Needed slow path, but raced with deletion. */
 return -EAGAIN;
   
   I am guessing that the cast ends the vdp.  Is that the case?
  
  And here is a more elaborate example from the Linux kernel:
  
  struct md_rdev value_dep_preserving *rdev;  /* CHANGE */
  
  rdev = rcu_dereference(conf-mirrors[disk].rdev);
  if (r1_bio-bios[disk] == IO_BLOCKED
  || rdev == NULL
  || test_bit(Unmerged, rdev-flags)
  || test_bit(Faulty, rdev-flags))
  continue;
  
  The fact that the rdev == NULL returns vdp does not force the ||
  operators to be evaluated arithmetically because the entire function
  is an if condition, correct?
 
 That's a good question, and one that as far as I understand currently,
 essentially boils down to whether we want to have tight restrictions on
 which operations are still vdp.
 
 If we look at the different combinations, then it seems we can't decide
 on whether we have a value-dependency just due to a vdp type:
 * non-vdp || vdp:  vdp iff non-vdp == false
 * vdp || non-vdp:  vdp iff non-vdp == false?
 * vdp || vdp: always vdp? (and dependency on both?)
 
 I'm not sure it makes sense to try to not make

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
 X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
 
 On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
  On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
   X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
   
   On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
+o  Do not use the results from the boolean  and || when
+   dereferencing.  For example, the following (rather improbable)
+   code is buggy:
+
+   int a[2];
+   int index;
+   int force_zero_index = 1;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
+
+   The reason this is buggy is that  and || are often 
compiled
+   using branches.  While weak-memory machines such as ARM or 
PowerPC
+   do order stores after such branches, they can speculate loads,
+   which can result in misordering bugs.
+
+o  Do not use the results from relational operators (==, !=,
+   , =, , or =) when dereferencing.  For example,
+   the following (quite strange) code is buggy:
+
+   int a[2];
+   int index;
+   int flip_index = 0;
+
+   ...
+
+   r1 = rcu_dereference(i1)
+   r2 = a[r1 != flip_index];  /* BUGGY!!! */
+
+   As before, the reason this is buggy is that relational operators
+   are often compiled using branches.  And as before, although
+   weak-memory machines such as ARM or PowerPC do order stores
+   after such branches, but can speculate loads, which can again
+   result in misordering bugs.
   
   Those two would be allowed by the wording I have recently proposed,
   AFAICS.  r1 != flip_index would result in two possible values (unless
   there are further constraints due to the type of r1 and the values that
   flip_index can have).
  
  And I am OK with the value_dep_preserving type providing more/better
  guarantees than we get by default from current compilers.
  
  One question, though.  Suppose that the code did not want a value
  dependency to be tracked through a comparison operator.  What does
  the developer do in that case?  (The reason I ask is that I have
  not yet found a use case in the Linux kernel that expects a value
  dependency to be tracked through a comparison.)
 
 Hmm.  I suppose use an explicit cast to non-vdp before or after the
 comparison?

That should work well assuming that things like if, while, and ?:
conditions are happy to take a vdp.  This assumes that p-a only returns
vdp if field a is declared vdp, otherwise we have vdps running wild
through the program.  ;-)

The other thing that can happen is that a vdp can get handed off to
another synchronization mechanism, for example, to reference counting:

p = atomic_load_explicit(gp, memory_order_consume);
if (do_something_with(p-a)) {
/* fast path protected by RCU. */
return 0;
}
if (atomic_inc_not_zero(p-refcnt) {
/* slow path protected by reference counting. */
return do_something_else_with((struct foo *)p);  /* CHANGE */
}
/* Needed slow path, but raced with deletion. */
return -EAGAIN;

I am guessing that the cast ends the vdp.  Is that the case?

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-04 Thread Paul E. McKenney
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote:
 On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com
  X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
  
  On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote:
   On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)

On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
 +oDo not use the results from the boolean  and || when
 + dereferencing.  For example, the following (rather improbable)
 + code is buggy:
 +
 + int a[2];
 + int index;
 + int force_zero_index = 1;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1  force_zero_index];  /* BUGGY!!! */
 +
 + The reason this is buggy is that  and || are often 
 compiled
 + using branches.  While weak-memory machines such as ARM or 
 PowerPC
 + do order stores after such branches, they can speculate loads,
 + which can result in misordering bugs.
 +
 +oDo not use the results from relational operators (==, !=,
 + , =, , or =) when dereferencing.  For example,
 + the following (quite strange) code is buggy:
 +
 + int a[2];
 + int index;
 + int flip_index = 0;
 +
 + ...
 +
 + r1 = rcu_dereference(i1)
 + r2 = a[r1 != flip_index];  /* BUGGY!!! */
 +
 + As before, the reason this is buggy is that relational operators
 + are often compiled using branches.  And as before, although
 + weak-memory machines such as ARM or PowerPC do order stores
 + after such branches, but can speculate loads, which can again
 + result in misordering bugs.

Those two would be allowed by the wording I have recently proposed,
AFAICS.  r1 != flip_index would result in two possible values (unless
there are further constraints due to the type of r1 and the values that
flip_index can have).
   
   And I am OK with the value_dep_preserving type providing more/better
   guarantees than we get by default from current compilers.
   
   One question, though.  Suppose that the code did not want a value
   dependency to be tracked through a comparison operator.  What does
   the developer do in that case?  (The reason I ask is that I have
   not yet found a use case in the Linux kernel that expects a value
   dependency to be tracked through a comparison.)
  
  Hmm.  I suppose use an explicit cast to non-vdp before or after the
  comparison?
 
 That should work well assuming that things like if, while, and ?:
 conditions are happy to take a vdp.  This assumes that p-a only returns
 vdp if field a is declared vdp, otherwise we have vdps running wild
 through the program.  ;-)
 
 The other thing that can happen is that a vdp can get handed off to
 another synchronization mechanism, for example, to reference counting:
 
   p = atomic_load_explicit(gp, memory_order_consume);
   if (do_something_with(p-a)) {
   /* fast path protected by RCU. */
   return 0;
   }
   if (atomic_inc_not_zero(p-refcnt) {
   /* slow path protected by reference counting. */
   return do_something_else_with((struct foo *)p);  /* CHANGE */
   }
   /* Needed slow path, but raced with deletion. */
   return -EAGAIN;
 
 I am guessing that the cast ends the vdp.  Is that the case?

And here is a more elaborate example from the Linux kernel:

struct md_rdev value_dep_preserving *rdev;  /* CHANGE */

rdev = rcu_dereference(conf-mirrors[disk].rdev);
if (r1_bio-bios[disk] == IO_BLOCKED
|| rdev == NULL
|| test_bit(Unmerged, rdev-flags)
|| test_bit(Faulty, rdev-flags))
continue;

The fact that the rdev == NULL returns vdp does not force the ||
operators to be evaluated arithmetically because the entire function
is an if condition, correct?

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-03 Thread Paul E. McKenney
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com
 X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC)
 
 On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote:
  +o  Do not use the results from the boolean  and || when
  +   dereferencing.  For example, the following (rather improbable)
  +   code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int force_zero_index = 1;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1  force_zero_index];  /* BUGGY!!! */
  +
  +   The reason this is buggy is that  and || are often compiled
  +   using branches.  While weak-memory machines such as ARM or PowerPC
  +   do order stores after such branches, they can speculate loads,
  +   which can result in misordering bugs.
  +
  +o  Do not use the results from relational operators (==, !=,
  +   , =, , or =) when dereferencing.  For example,
  +   the following (quite strange) code is buggy:
  +
  +   int a[2];
  +   int index;
  +   int flip_index = 0;
  +
  +   ...
  +
  +   r1 = rcu_dereference(i1)
  +   r2 = a[r1 != flip_index];  /* BUGGY!!! */
  +
  +   As before, the reason this is buggy is that relational operators
  +   are often compiled using branches.  And as before, although
  +   weak-memory machines such as ARM or PowerPC do order stores
  +   after such branches, but can speculate loads, which can again
  +   result in misordering bugs.
 
 Those two would be allowed by the wording I have recently proposed,
 AFAICS.  r1 != flip_index would result in two possible values (unless
 there are further constraints due to the type of r1 and the values that
 flip_index can have).

And I am OK with the value_dep_preserving type providing more/better
guarantees than we get by default from current compilers.

One question, though.  Suppose that the code did not want a value
dependency to be tracked through a comparison operator.  What does
the developer do in that case?  (The reason I ask is that I have
not yet found a use case in the Linux kernel that expects a value
dependency to be tracked through a comparison.)

 I don't think the wording is flawed.  We could raise the requirement of
 having more than one value left for r1 to having more than N with N  1
 values left, but the fundamental problem remains in that a compiler
 could try to generate a (big) switch statement.
 
 Instead, I think that this indicates that the value_dep_preserving type
 modifier would be useful: It would tell the compiler that it shouldn't
 transform this into a branch in this case, yet allow that optimization
 for all other code.

Understood!

BTW, my current task is generating examples using the value_dep_preserving
type for RCU-protected array indexes.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney
On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
 On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
  Hi Paul,
 
  On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
  wrote:
   On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
   On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 3.  The comparison was against another RCU-protected pointer,
 where that other pointer was properly fetched using one
 of the RCU primitives.  Here it doesn't matter which pointer
 you use.  At least as long as the rcu_assign_pointer() for
 that other pointer happened after the last update to the
 pointed-to structure.

 I am a bit nervous about #3.  Any thoughts on it?
   
I think that it might be worth pointing out as an example, and saying
that code like
   
   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p-val;
   
then the access of p-val is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.
   
I cannot for the life of me come up with a situation where this would
matter, though. If X contains a fence, then that fence will be a
stronger ordering than anything the consume through p would
guarantee anyway. And if X does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through p is through p or q is kind of
irrelevant. No?
  
   I can make a contrived litmus test for it, but you are right, the only
   time you can see it happen is when X has no barriers, in which case
   you don't have any ordering anyway -- both the compiler and the CPU can
   reorder the loads into p and q, and the read from p-val can, as you 
   say,
   come from either pointer.
  
   For whatever it is worth, hear is the litmus test:
  
   T1:   p = kmalloc(...);
 if (p == NULL)
 deal_with_it();
 p-a = 42;  /* Each field in its own cache line. */
 p-b = 43;
 p-c = 44;
 atomic_store_explicit(gp1, p, memory_order_release);
 p-b = 143;
 p-c = 144;
 atomic_store_explicit(gp2, p, memory_order_release);
  
   T2:   p = atomic_load_explicit(gp2, memory_order_consume);
 r1 = p-b;  /* Guaranteed to get 143. */
 q = atomic_load_explicit(gp1, memory_order_consume);
 if (p == q) {
 /* The compiler decides that q-c is same as p-c. */
 r2 = p-c; /* Could get 44 on weakly order system. */
 }
  
   The loads from gp1 and gp2 are, as you say, unordered, so you get what
   you get.
  
   And publishing a structure via one RCU-protected pointer, updating it,
   then publishing it via another pointer seems to me to be asking for
   trouble anyway.  If you really want to do something like that and still
   see consistency across all the fields in the structure, please put a 
   lock
   in the structure and use it to guard updates and accesses to those 
   fields.
  
   And here is a patch documenting the restrictions for the current Linux
   kernel.  The rules change a bit due to rcu_dereference() acting a bit
   differently than atomic_load_explicit(p, memory_order_consume).
  
   Thoughts?
 
  That might serve as informal documentation for linux kernel
  programmers about the bounds on the optimisations that you expect
  compilers to do for common-case RCU code - and I guess that's what you
  intend it to be for.   But I don't see how one can make it precise
  enough to serve as a language definition, so that compiler people
  could confidently say yes, we respect that, which I guess is what
  you really need.  As a useful criterion, we should aim for something
  precise enough that in a verified-compiler context you can
  mathematically prove that the compiler will satisfy it  (even though
  that won't happen anytime soon for GCC), and that analysis tool
  authors can actually know what they're working with.   All this stuff   
  about you should avoid cancellation, and avoid masking with just a
  small number of bits is just too vague.
 
  Understood, and yes, this is intended to document current compiler
  behavior for the Linux kernel community.  It would not make sense to show
  it to the C11 or C++11 communities, except perhaps as an informational
  piece on current practice.
 
  The basic problem is that the compiler may be doing sophisticated
  reasoning with a bunch of non-local knowledge that it's deduced from
  the code, neither of which are well-understood, and here we have to
  identify some envelope

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-02 Thread Paul E. McKenney
On Sun, Mar 02, 2014 at 11:44:52PM +, Peter Sewell wrote:
 On 2 March 2014 23:20, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
  On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote:
  On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote:
   On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
   Hi Paul,
  
   On 28 February 2014 18:50, Paul E. McKenney 
   paul...@linux.vnet.ibm.com wrote:
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  3.  The comparison was against another RCU-protected pointer,
  where that other pointer was properly fetched using one
  of the RCU primitives.  Here it doesn't matter which 
  pointer
  you use.  At least as long as the rcu_assign_pointer() 
  for
  that other pointer happened after the last update to the
  pointed-to structure.
 
  I am a bit nervous about #3.  Any thoughts on it?

 I think that it might be worth pointing out as an example, and 
 saying
 that code like

p = atomic_read(consume);
X;
q = atomic_read(consume);
Y;
if (p == q)
 data = p-val;

 then the access of p-val is constrained to be data-dependent on
 *either* p or q, but you can't really tell which, since the 
 compiler
 can decide that the values are interchangeable.

 I cannot for the life of me come up with a situation where this 
 would
 matter, though. If X contains a fence, then that fence will be a
 stronger ordering than anything the consume through p would
 guarantee anyway. And if X does *not* contain a fence, then the
 atomic reads of p and q are unordered *anyway*, so then whether the
 ordering to the access through p is through p or q is kind of
 irrelevant. No?
   
I can make a contrived litmus test for it, but you are right, the 
only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU 
can
reorder the loads into p and q, and the read from p-val can, as you 
say,
come from either pointer.
   
For whatever it is worth, hear is the litmus test:
   
T1:   p = kmalloc(...);
  if (p == NULL)
  deal_with_it();
  p-a = 42;  /* Each field in its own cache line. */
  p-b = 43;
  p-c = 44;
  atomic_store_explicit(gp1, p, memory_order_release);
  p-b = 143;
  p-c = 144;
  atomic_store_explicit(gp2, p, memory_order_release);
   
T2:   p = atomic_load_explicit(gp2, memory_order_consume);
  r1 = p-b;  /* Guaranteed to get 143. */
  q = atomic_load_explicit(gp1, memory_order_consume);
  if (p == q) {
  /* The compiler decides that q-c is same as p-c. */
  r2 = p-c; /* Could get 44 on weakly order system. */
  }
   
The loads from gp1 and gp2 are, as you say, unordered, so you get 
what
you get.
   
And publishing a structure via one RCU-protected pointer, updating 
it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and 
still
see consistency across all the fields in the structure, please put a 
lock
in the structure and use it to guard updates and accesses to those 
fields.
   
And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(p, memory_order_consume).
   
Thoughts?
  
   That might serve as informal documentation for linux kernel
   programmers about the bounds on the optimisations that you expect
   compilers to do for common-case RCU code - and I guess that's what you
   intend it to be for.   But I don't see how one can make it precise
   enough to serve as a language definition, so that compiler people
   could confidently say yes, we respect that, which I guess is what
   you really need.  As a useful criterion, we should aim for something
   precise enough that in a verified-compiler context you can
   mathematically prove that the compiler will satisfy it  (even though
   that won't happen anytime soon for GCC), and that analysis tool
   authors can actually know what they're working with.   All this stuff  
about you should avoid cancellation, and avoid masking with just a
   small number of bits is just too vague.
  
   Understood, and yes, this is intended to document current compiler
   behavior for the Linux kernel community.  It would not make sense to show
   it to the C11 or C++11 communities

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-03-01 Thread Paul E. McKenney
On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote:
 Hi Paul,
 
 On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com 
 wrote:
  On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
  On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
   On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
3.  The comparison was against another RCU-protected pointer,
where that other pointer was properly fetched using one
of the RCU primitives.  Here it doesn't matter which pointer
you use.  At least as long as the rcu_assign_pointer() for
that other pointer happened after the last update to the
pointed-to structure.
   
I am a bit nervous about #3.  Any thoughts on it?
  
   I think that it might be worth pointing out as an example, and saying
   that code like
  
  p = atomic_read(consume);
  X;
  q = atomic_read(consume);
  Y;
  if (p == q)
   data = p-val;
  
   then the access of p-val is constrained to be data-dependent on
   *either* p or q, but you can't really tell which, since the compiler
   can decide that the values are interchangeable.
  
   I cannot for the life of me come up with a situation where this would
   matter, though. If X contains a fence, then that fence will be a
   stronger ordering than anything the consume through p would
   guarantee anyway. And if X does *not* contain a fence, then the
   atomic reads of p and q are unordered *anyway*, so then whether the
   ordering to the access through p is through p or q is kind of
   irrelevant. No?
 
  I can make a contrived litmus test for it, but you are right, the only
  time you can see it happen is when X has no barriers, in which case
  you don't have any ordering anyway -- both the compiler and the CPU can
  reorder the loads into p and q, and the read from p-val can, as you say,
  come from either pointer.
 
  For whatever it is worth, hear is the litmus test:
 
  T1:   p = kmalloc(...);
if (p == NULL)
deal_with_it();
p-a = 42;  /* Each field in its own cache line. */
p-b = 43;
p-c = 44;
atomic_store_explicit(gp1, p, memory_order_release);
p-b = 143;
p-c = 144;
atomic_store_explicit(gp2, p, memory_order_release);
 
  T2:   p = atomic_load_explicit(gp2, memory_order_consume);
r1 = p-b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(gp1, memory_order_consume);
if (p == q) {
/* The compiler decides that q-c is same as p-c. */
r2 = p-c; /* Could get 44 on weakly order system. */
}
 
  The loads from gp1 and gp2 are, as you say, unordered, so you get what
  you get.
 
  And publishing a structure via one RCU-protected pointer, updating it,
  then publishing it via another pointer seems to me to be asking for
  trouble anyway.  If you really want to do something like that and still
  see consistency across all the fields in the structure, please put a lock
  in the structure and use it to guard updates and accesses to those fields.
 
  And here is a patch documenting the restrictions for the current Linux
  kernel.  The rules change a bit due to rcu_dereference() acting a bit
  differently than atomic_load_explicit(p, memory_order_consume).
 
  Thoughts?
 
 That might serve as informal documentation for linux kernel
 programmers about the bounds on the optimisations that you expect
 compilers to do for common-case RCU code - and I guess that's what you
 intend it to be for.   But I don't see how one can make it precise
 enough to serve as a language definition, so that compiler people
 could confidently say yes, we respect that, which I guess is what
 you really need.  As a useful criterion, we should aim for something
 precise enough that in a verified-compiler context you can
 mathematically prove that the compiler will satisfy it  (even though
 that won't happen anytime soon for GCC), and that analysis tool
 authors can actually know what they're working with.   All this stuff
 about you should avoid cancellation, and avoid masking with just a
 small number of bits is just too vague.

Understood, and yes, this is intended to document current compiler
behavior for the Linux kernel community.  It would not make sense to show
it to the C11 or C++11 communities, except perhaps as an informational
piece on current practice.

 The basic problem is that the compiler may be doing sophisticated
 reasoning with a bunch of non-local knowledge that it's deduced from
 the code, neither of which are well-understood, and here we have to
 identify some envelope, expressive enough for RCU idioms, in which
 that reasoning doesn't allow data/address dependencies to be removed
 (and hence the hardware guarantee about them will be maintained at the
 source level).
 
 The C11 syntactic notion of dependency, whatever its faults

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-28 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote:
 On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
  On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  
   3.  The comparison was against another RCU-protected pointer,
   where that other pointer was properly fetched using one
   of the RCU primitives.  Here it doesn't matter which pointer
   you use.  At least as long as the rcu_assign_pointer() for
   that other pointer happened after the last update to the
   pointed-to structure.
  
   I am a bit nervous about #3.  Any thoughts on it?
  
  I think that it might be worth pointing out as an example, and saying
  that code like
  
 p = atomic_read(consume);
 X;
 q = atomic_read(consume);
 Y;
 if (p == q)
  data = p-val;
  
  then the access of p-val is constrained to be data-dependent on
  *either* p or q, but you can't really tell which, since the compiler
  can decide that the values are interchangeable.
  
  I cannot for the life of me come up with a situation where this would
  matter, though. If X contains a fence, then that fence will be a
  stronger ordering than anything the consume through p would
  guarantee anyway. And if X does *not* contain a fence, then the
  atomic reads of p and q are unordered *anyway*, so then whether the
  ordering to the access through p is through p or q is kind of
  irrelevant. No?
 
 I can make a contrived litmus test for it, but you are right, the only
 time you can see it happen is when X has no barriers, in which case
 you don't have any ordering anyway -- both the compiler and the CPU can
 reorder the loads into p and q, and the read from p-val can, as you say,
 come from either pointer.
 
 For whatever it is worth, hear is the litmus test:
 
 T1:   p = kmalloc(...);
   if (p == NULL)
   deal_with_it();
   p-a = 42;  /* Each field in its own cache line. */
   p-b = 43;
   p-c = 44;
   atomic_store_explicit(gp1, p, memory_order_release);
   p-b = 143;
   p-c = 144;
   atomic_store_explicit(gp2, p, memory_order_release);
 
 T2:   p = atomic_load_explicit(gp2, memory_order_consume);
   r1 = p-b;  /* Guaranteed to get 143. */
   q = atomic_load_explicit(gp1, memory_order_consume);
   if (p == q) {
   /* The compiler decides that q-c is same as p-c. */
   r2 = p-c; /* Could get 44 on weakly order system. */
   }
 
 The loads from gp1 and gp2 are, as you say, unordered, so you get what
 you get.
 
 And publishing a structure via one RCU-protected pointer, updating it,
 then publishing it via another pointer seems to me to be asking for
 trouble anyway.  If you really want to do something like that and still
 see consistency across all the fields in the structure, please put a lock
 in the structure and use it to guard updates and accesses to those fields.

And here is a patch documenting the restrictions for the current Linux
kernel.  The rules change a bit due to rcu_dereference() acting a bit
differently than atomic_load_explicit(p, memory_order_consume).

Thoughts?

Thanx, Paul



documentation: Record rcu_dereference() value mishandling

Recent LKML discussings (see http://lwn.net/Articles/586838/ and
http://lwn.net/Articles/588300/ for the LWN writeups) brought out
some ways of misusing the return value from rcu_dereference() that
are not necessarily completely intuitive.  This commit therefore
documents what can and cannot safely be done with these values.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index fa57139f50bf..f773a264ae02 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -12,6 +12,8 @@ lockdep-splat.txt
- RCU Lockdep splats explained.
 NMI-RCU.txt
- Using RCU to Protect Dynamic NMI Handlers
+rcu_dereference.txt
+   - Proper care and feeding of return values from rcu_dereference()
 rcubarrier.txt
- RCU and Unloadable Modules
 rculist_nulls.txt
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 9d10d1db16a5..877947130ebe 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -114,12 +114,16 @@ over a rather long period of time, but improvements are 
always welcome!
http://www.openvms.compaq.com/wizard/wiz_2637.html
 
The rcu_dereference() primitive is also an excellent
-   documentation aid, letting the person reading the code
-   know exactly which pointers are protected by RCU.
+   documentation aid, letting the person reading the
+   code know exactly which pointers are protected by RCU.
Please note

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
 
 On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
  On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  
   Good points.  How about the following replacements?
  
   3.  Adding or subtracting an integer to/from a chained pointer
   results in another chained pointer in that same pointer chain.
   The results of addition and subtraction operations that cancel
   the chained pointer's value (for example, p-(long)p where p
   is a pointer to char) are implementation defined.
  
   4.  Bitwise operators (, |, ^, and I suppose also ~)
   applied to a chained pointer and an integer for the purposes
   of alignment and pointer translation results in another
   chained pointer in that same pointer chain.  Other uses
   of bitwise operators on chained pointers (for example,
   p|~0) are implementation defined.
  
  Quite frankly, I think all of this language that is about the actual
  operations is irrelevant and wrong.
  
  It's not going to help compiler writers, and it sure isn't going to
  help users that read this.
  
  Why not just talk about value chains and that any operations that
  restrict the value range severely end up breaking the chain. There is
  no point in listing the operations individually, because every single
  operation *can* restrict things. Listing individual operations and
  depdendencies is just fundamentally wrong.
 
 [...]
 
  The *only* thing that matters for all of them is whether they are
  value-preserving, or whether they drop so much information that the
  compiler might decide to use a control dependency instead. That's true
  for every single one of them.
  
  Similarly, actual true control dependencies that limit the problem
  space sufficiently that the actual pointer value no longer has
  significant information in it (see the above example) are also things
  that remove information to the point that only a control dependency
  remains. Even when the value itself is not modified in any way at all.
 
 I agree that just considering syntactic properties of the program seems
 to be insufficient.  Making it instead depend on whether there is a
 semantic dependency due to a value being necessary to compute a
 result seems better.  However, whether a value is necessary might not
 be obvious, and I understand Paul's argument that he does not want to
 have to reason about all potential compiler optimizations.  Thus, I
 believe we need to specify when a value is necessary.
 
 I have a suggestion for a somewhat different formulation of the feature
 that you seem to have in mind, which I'll discuss below.  Excuse the
 verbosity of the following, but I'd rather like to avoid
 misunderstandings than save a few words.

Thank you very much for putting this forward!  I must confess that I was
stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
is quite clearly way bogus.

One possible saving grace:  From discussions at the standards committee
meeting a few weeks ago, there is a some chance that the committee will
be willing to do a rip-and-replace on the current memory_order_consume
wording, without provisions for backwards compatibility with the current
bogosity.

 What we'd like to capture is that a value originating from a mo_consume
 load is necessary for a computation (e.g., it cannot be replaced
 with value predictions and/or control dependencies); if that's the case
 in the program, we can reasonably assume that a compiler implementation
 will transform this into a data dependency, which will then lead to
 ordering guarantees by the HW.
 
 However, we need to specify when a value is necessary.  We could say
 that this is implementation-defined, and use a set of litmus tests
 (e.g., like those discussed in the thread) to roughly carve out what a
 programmer could expect.  This may even be practical for a project like
 the Linux kernel that follows strict project-internal rules and pays a
 lot of attention to what the particular implementations of compilers
 expected to compile the kernel are doing.  However, I think this
 approach would be too vague for the standard and for many other
 programs/projects.

I agree that a number of other projects would have more need for this than
might the kernel.  Please understand that this is in no way denigrating
the intelligence of other projects' members.  It is just that many of
them have only recently started seriously thinking about concurrency.
In contrast, the Linux kernel community has been doing concurrency since
the mid-1990s.  Projects with less experience with concurrency will
probably need more help, from the compiler and from elsewhere as well.

Your proposal looks quite promising at first glance.  But rather than
try and comment on it immediately, I am going to take a number

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote:
 
  I agree that just considering syntactic properties of the program seems
  to be insufficient.  Making it instead depend on whether there is a
  semantic dependency due to a value being necessary to compute a
  result seems better.  However, whether a value is necessary might not
  be obvious, and I understand Paul's argument that he does not want to
  have to reason about all potential compiler optimizations.  Thus, I
  believe we need to specify when a value is necessary.
 
 I suspect it's hard to really strictly define, but at the same time I
 actually think that compiler writers (and users, for that matter) have
 little problem understanding the concept and intent.
 
 I do think that listing operations might be useful to give good
 examples of what is a necessary value, and - perhaps more
 importantly - what can break the value from being necessary.
 Especially the gotchas.
 
  I have a suggestion for a somewhat different formulation of the feature
  that you seem to have in mind, which I'll discuss below.  Excuse the
  verbosity of the following, but I'd rather like to avoid
  misunderstandings than save a few words.
 
 Ok, I'm going to cut most of the verbiage since it's long and I'm not
 commenting on most of it.
 
 But
 
  Based on these thoughts, we could specify the new mo_consume guarantees
  roughly as follows:
 
  An evaluation E (in an execution) has a value dependency to an
  atomic and mo_consume load L (in an execution) iff:
  * L's type holds more than one value (ruling out constants
  etc.),
  * L is sequenced-before E,
  * L's result is used by the abstract machine to compute E,
  * E is value-dependency-preserving code (defined below), and
  * at the time of execution of E, L can possibly have returned at
  least two different values under the assumption that L itself
  could have returned any value allowed by L's type.
 
  If a memory access A's targeted memory location has a value
  dependency on a mo_consume load L, and an action X
  inter-thread-happens-before L, then X happens-before A.
 
 I think this mostly works.
 
  Regarding the latter, we make a fresh start at each mo_consume load (ie,
  we assume we know nothing -- L could have returned any possible value);
  I believe this is easier to reason about than other scopes like function
  granularities (what happens on inlining?), or translation units.  It
  should also be simple to implement for compilers, and would hopefully
  not constrain optimization too much.
 
  [...]
 
  Paul's litmus test would work, because we guarantee to the programmer
  that it can assume that the mo_consume load would return any value
  allowed by the type; effectively, this forbids the compiler analysis
  Paul thought about:
 
 So realistically, since with the new wording we can ignore the silly
 cases (ie p-p) and we can ignore the trivial-to-optimize compiler
 cases (if (p == variable) .. use p), and you would forbid the
 global value range optimization case that Paul bright up, what
 remains would seem to be just really subtle compiler transformations
 of data dependencies to control dependencies.

FWIW, I am looking through the kernel for instances of your first
if (p == variable) .. use p limus test.  All the ones I have found
thus far are OK for one of the following reasons:

1.  The comparison was against NULL, so you don't get to dereference
the pointer anyway.  About 80% are in this category.

2.  The comparison was against another pointer, but there were no
dereferences afterwards.  Here is an example of what these
can look like:

list_for_each_entry_rcu(p, head, next)
if (p == variable)
return; /* p goes out of scope. */

3.  The comparison was against another RCU-protected pointer,
where that other pointer was properly fetched using one
of the RCU primitives.  Here it doesn't matter which pointer
you use.  At least as long as the rcu_assign_pointer() for
that other pointer happened after the last update to the
pointed-to structure.

I am a bit nervous about #3.  Any thoughts on it?

Some other reasons why it would be OK to dereference after a comparison:

4.  The pointed-to data is constant: (a) It was initialized at
boot time, (b) the update-side lock is held, (c) we are
running in a kthread and the data was initialized before the
kthread was created, (d) we are running in a module, and
the data was initialized during or before module-init time
for that module.  And many more besides, involving pretty
much every kernel primitive that makes something run later.

5.  All subsequent dereferences 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
 On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
  
  On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
   On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
Good points.  How about the following replacements?
   
3.  Adding or subtracting an integer to/from a chained pointer
results in another chained pointer in that same pointer chain.
The results of addition and subtraction operations that cancel
the chained pointer's value (for example, p-(long)p where p
is a pointer to char) are implementation defined.
   
4.  Bitwise operators (, |, ^, and I suppose also ~)
applied to a chained pointer and an integer for the purposes
of alignment and pointer translation results in another
chained pointer in that same pointer chain.  Other uses
of bitwise operators on chained pointers (for example,
p|~0) are implementation defined.
   
   Quite frankly, I think all of this language that is about the actual
   operations is irrelevant and wrong.
   
   It's not going to help compiler writers, and it sure isn't going to
   help users that read this.
   
   Why not just talk about value chains and that any operations that
   restrict the value range severely end up breaking the chain. There is
   no point in listing the operations individually, because every single
   operation *can* restrict things. Listing individual operations and
   depdendencies is just fundamentally wrong.
  
  [...]
  
   The *only* thing that matters for all of them is whether they are
   value-preserving, or whether they drop so much information that the
   compiler might decide to use a control dependency instead. That's true
   for every single one of them.
   
   Similarly, actual true control dependencies that limit the problem
   space sufficiently that the actual pointer value no longer has
   significant information in it (see the above example) are also things
   that remove information to the point that only a control dependency
   remains. Even when the value itself is not modified in any way at all.
  
  I agree that just considering syntactic properties of the program seems
  to be insufficient.  Making it instead depend on whether there is a
  semantic dependency due to a value being necessary to compute a
  result seems better.  However, whether a value is necessary might not
  be obvious, and I understand Paul's argument that he does not want to
  have to reason about all potential compiler optimizations.  Thus, I
  believe we need to specify when a value is necessary.
  
  I have a suggestion for a somewhat different formulation of the feature
  that you seem to have in mind, which I'll discuss below.  Excuse the
  verbosity of the following, but I'd rather like to avoid
  misunderstandings than save a few words.
 
 Thank you very much for putting this forward!  I must confess that I was
 stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
 is quite clearly way bogus.
 
 One possible saving grace:  From discussions at the standards committee
 meeting a few weeks ago, there is a some chance that the committee will
 be willing to do a rip-and-replace on the current memory_order_consume
 wording, without provisions for backwards compatibility with the current
 bogosity.
 
  What we'd like to capture is that a value originating from a mo_consume
  load is necessary for a computation (e.g., it cannot be replaced
  with value predictions and/or control dependencies); if that's the case
  in the program, we can reasonably assume that a compiler implementation
  will transform this into a data dependency, which will then lead to
  ordering guarantees by the HW.
  
  However, we need to specify when a value is necessary.  We could say
  that this is implementation-defined, and use a set of litmus tests
  (e.g., like those discussed in the thread) to roughly carve out what a
  programmer could expect.  This may even be practical for a project like
  the Linux kernel that follows strict project-internal rules and pays a
  lot of attention to what the particular implementations of compilers
  expected to compile the kernel are doing.  However, I think this
  approach would be too vague for the standard and for many other
  programs/projects.
 
 I agree that a number of other projects would have more need for this than
 might the kernel.  Please understand that this is in no way denigrating
 the intelligence of other projects' members.  It is just that many of
 them have only recently started seriously thinking about concurrency.
 In contrast, the Linux kernel community has been doing concurrency since
 the mid-1990s.  Projects with less experience with concurrency will
 probably need more help, from the compiler

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote:
 On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  3.  The comparison was against another RCU-protected pointer,
  where that other pointer was properly fetched using one
  of the RCU primitives.  Here it doesn't matter which pointer
  you use.  At least as long as the rcu_assign_pointer() for
  that other pointer happened after the last update to the
  pointed-to structure.
 
  I am a bit nervous about #3.  Any thoughts on it?
 
 I think that it might be worth pointing out as an example, and saying
 that code like
 
p = atomic_read(consume);
X;
q = atomic_read(consume);
Y;
if (p == q)
 data = p-val;
 
 then the access of p-val is constrained to be data-dependent on
 *either* p or q, but you can't really tell which, since the compiler
 can decide that the values are interchangeable.
 
 I cannot for the life of me come up with a situation where this would
 matter, though. If X contains a fence, then that fence will be a
 stronger ordering than anything the consume through p would
 guarantee anyway. And if X does *not* contain a fence, then the
 atomic reads of p and q are unordered *anyway*, so then whether the
 ordering to the access through p is through p or q is kind of
 irrelevant. No?

I can make a contrived litmus test for it, but you are right, the only
time you can see it happen is when X has no barriers, in which case
you don't have any ordering anyway -- both the compiler and the CPU can
reorder the loads into p and q, and the read from p-val can, as you say,
come from either pointer.

For whatever it is worth, hear is the litmus test:

T1: p = kmalloc(...);
if (p == NULL)
deal_with_it();
p-a = 42;  /* Each field in its own cache line. */
p-b = 43;
p-c = 44;
atomic_store_explicit(gp1, p, memory_order_release);
p-b = 143;
p-c = 144;
atomic_store_explicit(gp2, p, memory_order_release);

T2: p = atomic_load_explicit(gp2, memory_order_consume);
r1 = p-b;  /* Guaranteed to get 143. */
q = atomic_load_explicit(gp1, memory_order_consume);
if (p == q) {
/* The compiler decides that q-c is same as p-c. */
r2 = p-c; /* Could get 44 on weakly order system. */
}

The loads from gp1 and gp2 are, as you say, unordered, so you get what
you get.

And publishing a structure via one RCU-protected pointer, updating it,
then publishing it via another pointer seems to me to be asking for
trouble anyway.  If you really want to do something like that and still
see consistency across all the fields in the structure, please put a lock
in the structure and use it to guard updates and accesses to those fields.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Paul E. McKenney
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote:
 On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote:
  xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com
  
  On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote:
   On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
Good points.  How about the following replacements?
   
3.  Adding or subtracting an integer to/from a chained pointer
results in another chained pointer in that same pointer chain.
The results of addition and subtraction operations that cancel
the chained pointer's value (for example, p-(long)p where p
is a pointer to char) are implementation defined.
   
4.  Bitwise operators (, |, ^, and I suppose also ~)
applied to a chained pointer and an integer for the purposes
of alignment and pointer translation results in another
chained pointer in that same pointer chain.  Other uses
of bitwise operators on chained pointers (for example,
p|~0) are implementation defined.
   
   Quite frankly, I think all of this language that is about the actual
   operations is irrelevant and wrong.
   
   It's not going to help compiler writers, and it sure isn't going to
   help users that read this.
   
   Why not just talk about value chains and that any operations that
   restrict the value range severely end up breaking the chain. There is
   no point in listing the operations individually, because every single
   operation *can* restrict things. Listing individual operations and
   depdendencies is just fundamentally wrong.
  
  [...]
  
   The *only* thing that matters for all of them is whether they are
   value-preserving, or whether they drop so much information that the
   compiler might decide to use a control dependency instead. That's true
   for every single one of them.
   
   Similarly, actual true control dependencies that limit the problem
   space sufficiently that the actual pointer value no longer has
   significant information in it (see the above example) are also things
   that remove information to the point that only a control dependency
   remains. Even when the value itself is not modified in any way at all.
  
  I agree that just considering syntactic properties of the program seems
  to be insufficient.  Making it instead depend on whether there is a
  semantic dependency due to a value being necessary to compute a
  result seems better.  However, whether a value is necessary might not
  be obvious, and I understand Paul's argument that he does not want to
  have to reason about all potential compiler optimizations.  Thus, I
  believe we need to specify when a value is necessary.
  
  I have a suggestion for a somewhat different formulation of the feature
  that you seem to have in mind, which I'll discuss below.  Excuse the
  verbosity of the following, but I'd rather like to avoid
  misunderstandings than save a few words.
 
 Thank you very much for putting this forward!  I must confess that I was
 stuck, and my earlier attempt now enshrined in the C11 and C++11 standards
 is quite clearly way bogus.
 
 One possible saving grace:  From discussions at the standards committee
 meeting a few weeks ago, there is a some chance that the committee will
 be willing to do a rip-and-replace on the current memory_order_consume
 wording, without provisions for backwards compatibility with the current
 bogosity.
 
  What we'd like to capture is that a value originating from a mo_consume
  load is necessary for a computation (e.g., it cannot be replaced
  with value predictions and/or control dependencies); if that's the case
  in the program, we can reasonably assume that a compiler implementation
  will transform this into a data dependency, which will then lead to
  ordering guarantees by the HW.
  
  However, we need to specify when a value is necessary.  We could say
  that this is implementation-defined, and use a set of litmus tests
  (e.g., like those discussed in the thread) to roughly carve out what a
  programmer could expect.  This may even be practical for a project like
  the Linux kernel that follows strict project-internal rules and pays a
  lot of attention to what the particular implementations of compilers
  expected to compile the kernel are doing.  However, I think this
  approach would be too vague for the standard and for many other
  programs/projects.
 
 I agree that a number of other projects would have more need for this than
 might the kernel.  Please understand that this is in no way denigrating
 the intelligence of other projects' members.  It is just that many of
 them have only recently started seriously thinking about concurrency.
 In contrast, the Linux kernel community has been doing concurrency since
 the mid-1990s.  Projects with less experience with concurrency will
 probably need more help, from the compiler

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-26 Thread Paul E. McKenney
On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140226130517.3...@vmsdvma.vnet.ibm.com
 X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA)
 
 On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote:
  On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
   Hi,
   
   On Thu, 20 Feb 2014, Linus Torvalds wrote:
   
But I'm pretty sure that any compiler guy must *hate* that current odd
dependency-generation part, and if I was a gcc person, seeing that
bugzilla entry Torvald pointed at, I would personally want to
dismember somebody with a rusty spoon..
   
   Yes.  Defining dependency chains in the way the standard currently seems 
   to do must come from people not writing compilers.  There's simply no 
   sensible way to implement it without being really conservative, because 
   the depchains can contain arbitrary constructs including stores, 
   loads and function calls but must still be observed.  
   
   And with conservative I mean everything is a source of a dependency, and 
   hence can't be removed, reordered or otherwise fiddled with, and that 
   includes code sequences where no atomic objects are anywhere in sight [1].
   In the light of that the only realistic way (meaning to not have to 
   disable optimization everywhere) to implement consume as currently 
   specified is to map it to acquire.  At which point it becomes pointless.
  
  No, only memory_order_consume loads and [[carries_dependency]]
  function arguments are sources of dependency chains.
 
 However, that is, given how the standard specifies things, just one of
 the possible ways for how an implementation can handle this.  Treating
 [[carries_dependency]] as a necessary annotation to make exploiting
 mo_consume work in practice is possible, but it's not required by the
 standard.
 
 Also, dependencies are specified to flow through loads and stores
 (restricted to scalar objects and bitfields), so any load that might
 load from a dependency-carrying store can also be a source (and that
 doesn't seem to be restricted by [[carries_dependency]]).

OK, this last is clearly totally unacceptable.  :-/

Leaving aside the option of dropping the whole thing for the moment,
the only thing that suggests itself is having all dependencies die at
a specific point in the code, corresponding to the rcu_read_unlock().
But as far as I can see, that absolutely requires necessary parameter
and return marking in order to correctly handle nested RCU read-side
critical sections in different functions.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Litmus test 1:
 
  p = atomic_read(pp, consume);
  if (p == variable)
  return p-val;
 
 is *NOT* ordered
 
 Btw, don't get me wrong. I don't _like_ it not being ordered, and I
 actually did spend some time thinking about my earlier proposal on
 strengthening the 'consume' ordering.

Understood.

 I have for the last several years been 100% convinced that the Intel
 memory ordering is the right thing, and that people who like weak
 memory ordering are wrong and should try to avoid reproducing if at
 all possible. But given that we have memory orderings like power and
 ARM, I don't actually see a sane way to get a good strong ordering.
 You can teach compilers about cases like the above when they actually
 see all the code and they could poison the value chain etc. But it
 would be fairly painful, and once you cross object files (or even just
 functions in the same compilation unit, for that matter), it goes from
 painful to just ridiculously not worth it.

And I have indeed seen a post or two from you favoring stronger memory
ordering over the past few years.  ;-)

 So I think the C semantics should mirror what the hardware gives us -
 and do so even in the face of reasonable optimizations - not try to do
 something else that requires compilers to treat consume very
 differently.

I am sure that a great many people would jump for joy at the chance to
drop any and all RCU-related verbiage from the C11 and C++11 standards.
(I know, you aren't necessarily advocating this, but given what you
say above, I cannot think what verbiage that would remain.)

The thing that makes me very nervous is how much the definition of
reasonable optimization has changed.  For example, before the
2.6.10 Linux kernel, we didn't even apply volatile semantics to
fetches of RCU-protected pointers -- and as far as I know, never
needed to.  But since then, there have been several cases where the
compiler happily hoisted a normal load out of a surprisingly large loop.
Hardware advances can come into play as well.  For example, my very first
RCU work back in the early 90s was on a parallel system whose CPUs had
no branch-prediction hardware (80386 or 80486, I don't remember which).
Now people talk about compilers using branch prediction hardware to
implement value-speculation optimizations.  Five or ten years from now,
who knows what crazy optimizations might be considered to be completely
reasonable?

Are ARM and Power really the bad boys here?  Or are they instead playing
the role of the canary in the coal mine?

 If people made me king of the world, I'd outlaw weak memory ordering.
 You can re-order as much as you want in hardware with speculation etc,
 but you should always *check* your speculation and make it *look* like
 you did everything in order. Which is pretty much the intel memory
 ordering (ignoring the write buffering).

Speaking as someone who got whacked over the head with DEC Alpha when
first presenting RCU to the Digital UNIX folks long ago, I do have some
sympathy with this line of thought.  But as you say, it is not the world
we currently live in.

Of course, in the final analysis, your kernel, your call.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  So let me see if I understand your reasoning.  My best guess is that it
  goes something like this:
 
  1.  The Linux kernel contains code that passes pointers from
  rcu_dereference() through external functions.
 
 No, actually, it's not so much Linux-specific at all.
 
 I'm actually thinking about what I'd do as a compiler writer, and as a
 defender the C is a high-level assembler concept.
 
 I love C. I'm a huge fan. I think it's a great language, and I think
 it's a great language not because of some theoretical issues, but
 because it is the only language around that actually maps fairly well
 to what machines really do.
 
 And it's a *simple* language. Sure, it's not quite as simple as it
 used to be, but look at how thin the KR book is. Which pretty much
 describes it - still.
 
 That's the real strength of C, and why it's the only language serious
 people use for system programming.  Ignore C++ for a while (Jesus
 Xavier Christ, I've had to do C++ programming for subsurface), and
 just think about what makes _C_ a good language.

The last time I used C++ for a project was in 1990.  It was a lot smaller
then.

 I can look at C code, and I can understand what the code generation
 is, and what it will really *do*. And I think that's important.
 Abstractions that hide what the compiler will actually generate are
 bad abstractions.
 
 And ok, so this is obviously Linux-specific in that it's generally
 only Linux where I really care about the code generation, but I do
 think it's a bigger issue too.
 
 So I want C features to *map* to the hardware features they implement.
 The abstractions should match each other, not fight each other.

OK...

  Actually, the fact that there are more potential optimizations than I can
  think of is a big reason for my insistence on the carries-a-dependency
  crap.  My lack of optimization omniscience makes me very nervous about
  relying on there never ever being a reasonable way of computing a given
  result without preserving the ordering.
 
 But if I can give two clear examples that are basically identical from
 a syntactic standpoint, and one clearly can be trivially optimized to
 the point where the ordering guarantee goes away, and the other
 cannot, and you cannot describe the difference, then I think your
 description is seriously lacking.

In my defense, my plan was to constrain the compiler to retain the
ordering guarantee in either case.  Yes, I did notice that you find
that unacceptable.

 And I do *not* think the C language should be defined by how it can be
 described. Leave that to things like Haskell or LISP, where the goal
 is some kind of completeness of the language that is about the
 language, not about the machines it will run on.

I am with you up to the point that the fancy optimizers start kicking
in.  I don't know how to describe what the optimizers are and are not
permitted to do strictly in terms of the underlying hardware.

  So the code sequence I already mentioned is *not* ordered:
 
  Litmus test 1:
 
  p = atomic_read(pp, consume);
  if (p == variable)
  return p-val;
 
 is *NOT* ordered, because the compiler can trivially turn this into
  return variable.val, and break the data dependency.
 
  Right, given your model, the compiler is free to produce code that
  doesn't order the load from pp against the load from p-val.
 
 Yes. Note also that that is what existing compilers would actually do.
 
 And they'd do it by mistake: they'd load the address of the variable
 into a register, and then compare the two registers, and then end up
 using _one_ of the registers as the base pointer for the p-val
 access, but I can almost *guarantee* that there are going to be
 sequences where some compiler will choose one register over the other
 based on some random detail.
 
 So my model isn't just a model, it also happens to descibe reality.

Sounds to me like your model -is- reality.  I believe that it is useful
to constrain reality from time to time, but understand that you vehemently
disagree.

  Indeed, it won't work across different compilation units unless
  the compiler is told about it, which is of course the whole point of
  [[carries_dependency]].  Understood, though, the Linux kernel currently
  does not have anything that could reasonably automatically generate those
  [[carries_dependency]] attributes.  (Or are there other reasons why you
  believe [[carries_dependency]] is problematic?)
 
 So I think carries_dependency is problematic because:
 
  - it's not actually in C11 afaik

Indeed it is not, but I bet that gcc will implement it like it does the
other attributes that are not part of C11.

  - it requires the programmer to solve the problem of the standard not
 matching the hardware.

The programmer in this instance being the compiler writer?

  - I

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote:
 paul...@linux.vnet.ibm.com wrote:
  torva...@linux-foundation.org wrote:
  I have for the last several years been 100% convinced that the Intel
  memory ordering is the right thing, and that people who like weak
  memory ordering are wrong and should try to avoid reproducing if at
  all possible.
 
  Are ARM and Power really the bad boys here?  Or are they instead playing
  the role of the canary in the coal mine?
 
 To paraphrase some older threads, I think Linus's argument is that
 weak memory ordering is like branch delay slots: a way to make a simple
 implementation simpler, but ends up being no help to a more aggressive
 implementation.
 
 Branch delay slots give a one-cycle bonus to in-order cores, but
 once you go superscalar and add branch prediction, they stop helping,
 and once you go full out of order, they're just an annoyance.
 
 Likewise, I can see the point that weak ordering can help make a simple
 cache interface simpler, but once you start doing speculative loads,
 you've already bought and paid for all the hardware you need to do
 stronger coherency.
 
 Another thing that requires all the strong-coherency machinery is
 a high-performance implementation of the various memory barrier and
 synchronization operations.  Yes, a low-performance (drain the pipeline)
 implementation is tolerable if the instructions aren't used frequently,
 but once you're really trying, it doesn't save complexity.
 
 Once you're there, strong coherency always doesn't actually cost you any
 time outside of critical synchronization code, and it both simplifies
 and speeds up the tricky synchronization software.
 
 
 So PPC and ARM's weak ordering are not the direction the future is going.
 Rather, weak ordering is something that's only useful in a limited
 technology window, which is rapidly passing.

That does indeed appear to be Intel's story.  Might well be correct.
Time will tell.

 If you can find someone in IBM who's worked on the Z series cache
 coherency (extremely strong ordering), they probably have some useful
 insights.  The big question is if strong ordering, once you've accepted
 the implementation complexity and area, actually costs anything in
 execution time.  If there's an unavoidable cost which weak ordering saves,
 that's significant.

There has been a lot of ink spilled on this argument.  ;-)

PPC has much larger CPU counts than does the mainframe.  On the other
hand, there are large x86 systems.  Some claim that there are differences
in latency due to the different approaches, and there could be a long
argument about whether all this in inherent in the memory ordering or
whether it is due to implementation issues.

I don't claim to know the answer.  I do know that ARM and PPC are
here now, and that I need to deal with them.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Paul E. McKenney
On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote:
 On 02/25/14 17:15, Paul E. McKenney wrote:
 I have for the last several years been 100% convinced that the Intel
 memory ordering is the right thing, and that people who like weak
 memory ordering are wrong and should try to avoid reproducing if at
 all possible. But given that we have memory orderings like power and
 ARM, I don't actually see a sane way to get a good strong ordering.
 You can teach compilers about cases like the above when they actually
 see all the code and they could poison the value chain etc. But it
 would be fairly painful, and once you cross object files (or even just
 functions in the same compilation unit, for that matter), it goes from
 painful to just ridiculously not worth it.
 
 And I have indeed seen a post or two from you favoring stronger memory
 ordering over the past few years.  ;-)
 I couldn't agree more.
 
 
 Are ARM and Power really the bad boys here?  Or are they instead playing
 the role of the canary in the coal mine?
 That's a question I've been struggling with recently as well.  I
 suspect they (arm, power) are going to be the outliers rather than
 the canary. While the weaker model may give them some advantages WRT
 scalability, I don't think it'll ultimately be enough to overcome
 the difficulty in writing correct low level code for them.
 
 Regardless, they're here and we have to deal with them.

Agreed...

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
 Hi,
 
 On Mon, 24 Feb 2014, Linus Torvalds wrote:
 
   To me that reads like
  
 int i;
 int *q = i;
 int **p = q;
  
 atomic_XXX (p, CONSUME);
  
   orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
   want to say that it orders against aliased storage - but then go further
   and include indirectly through a chain of pointers?!  Thus an
   atomic read of a int * orders against any 'int' memory operation but
   not against 'float' memory operations?
  
  No, it's not about type at all, and the chain of pointers can be
  much more complex than that, since the int * can point to within an
  object that contains other things than just that int (the int can
  be part of a structure that then has pointers to other structures
  etc).
 
 So, let me try to poke holes into your definition or increase my 
 understanding :) .  You said chain of pointers(dereferences I assume), 
 e.g. if p is result of consume load, then access to 
 p-here-there-next-prev-stuff is supposed to be ordered with that load 
 (or only when that last load/store itself is also an atomic load or 
 store?).
 
 So, what happens if the pointer deref chain is partly hidden in some 
 functions:
 
 A * adjustptr (B *ptr) { return ptr-here-there-next; }
 B * p = atomic_XXX (somewhere, consume);
 adjustptr(p)-prev-stuff = bla;
 
 As far as I understood you, this whole ptrderef chain business would be 
 only an optimization opportunity, right?  So if the compiler can't be sure 
 how p is actually used (as in my function-using case, assume adjustptr is 
 defined in another unit), then the consume load would simply be 
 transformed into an acquire (or whatever, with some barrier I mean)?  Only 
 _if_ the compiler sees all obvious uses of p (indirectly through pointer 
 derefs) can it, yeah, do what with the consume load?

Good point, I left that out of my list.  Adding it:

13. By default, pointer chains do not propagate into or out of functions.
In implementations having attributes, a [[carries_dependency]]
may be used to mark a function argument or return as passing
a pointer chain into or out of that function.

If a function does not contain memory_order_consume loads and
also does not contain [[carries_dependency]] attributes, then
that function may be compiled using any desired dependency-breaking
optimizations.

The ordering effects are implementation defined when a given
pointer chain passes into or out of a function through a parameter
or return not marked with a [[carries_dependency]] attributed.

Note that this last paragraph differs from the current standard, which
would require ordering regardless.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 02:55:07PM +0100, Michael Matz wrote:
 Hi,
 
 On Fri, 21 Feb 2014, Paul E. McKenney wrote:
 
   And with conservative I mean everything is a source of a dependency, and 
   hence can't be removed, reordered or otherwise fiddled with, and that 
   includes code sequences where no atomic objects are anywhere in sight [1].
   In the light of that the only realistic way (meaning to not have to 
   disable optimization everywhere) to implement consume as currently 
   specified is to map it to acquire.  At which point it becomes pointless.
  
  No, only memory_order_consume loads and [[carries_dependency]]
  function arguments are sources of dependency chains.
 
 I don't see [[carries_dependency]] in the C11 final draft (yeah, should 
 get a real copy, I know, but let's assume it's the same language as the 
 standard).  Therefore, yes, only consume loads are sources of 
 dependencies.  The problem with the definition of the carries a 
 dependency relation is not the sources, but rather where it stops.  
 It's transitively closed over value of evaluation A is used as operand in 
 evaluation B, with very few exceptions as per 5.1.2.4#14.  Evaluations 
 can contain function calls, so if there's _any_ chance that an operand of 
 an evaluation might even indirectly use something resulting from a consume 
 load then that evaluation must be compiled in a way to not break 
 dependency chains.
 
 I don't see a way to generally assume that e.g. the value of a function 
 argument can impossibly result from a consume load, therefore the compiler 
 must assume that all function arguments _can_ result from such loads, and 
 so must disable all depchain breaking optimization (which are many).
 
   [1] Simple example of what type of transformations would be disallowed:
   
   int getzero (int i) { return i - i; }
  
  This needs to be as follows:
  
  [[carries_dependency]] int getzero(int i [[carries_dependency]])
  {
  return i - i;
  }
  
  Otherwise dependencies won't get carried through it.
 
 So, with the above do you agree that in absense of any other magic (see 
 below) the compiler is not allowed to transform my initial getzero() 
 (without the carries_dependency markers) implementation into return 0; 
 because of the C11 rules for carries-a-dependency?
 
 If so, do you then also agree that the specification of carries a 
 dependency is somewhat, err, shall we say, overbroad?

From what I can see, overbroad.  The problem is that the C++11 standard
defines how carries-dependency interacts with function calls and returns
in 7.6.4, which describes the [[carries_dependency]] attribute.  For example,
7.6.4p6 says:

Function g’s second parameter has a carries_dependency
attribute, but its first parameter does not. Therefore, function
h’s first call to g carries a dependency into g, but its second
call does not. The implementation might need to insert a fence
prior to the second call to g.

When C11 declined to take attributes, they also left out the part saying
how carries-dependency interacts with functions.  :-/

Might be fixed by now, checking up on it.

One could argue that the bit about emitting fence instructions at
function calls and returns is implied by the as-if rule even without
this wording, but...

   depchains don't matter, could _then_ optmize it to zero.  But that's 
   insane, especially considering that it's hard to detect if a given 
   context 
   doesn't care for depchains, after all the depchain relation is 
   constructed 
   exactly so that it bleeds into nearly everywhere.  So we would most of 
   the time have to assume that the ultimate context will be depchain-aware 
   and therefore disable many transformations.
  
  Any function that does not contain a memory_order_consume load and that 
  doesn't have any arguments marked [[carries_dependency]] can be 
  optimized just as before.
 
 And as such marker doesn't exist we must conservatively assume that it's 
 on _all_ parameters, so I'll stand by my claim.

Or that you have to emit a fence instruction when a dependency chain
enters or leaves a function in cases where all callers/calles are not
visible to the compiler.

My preference is that the ordering properties of a carries-dependency
chain is implementation defined at the point that it enters or leaves
a function without the marker, but others strongly disagreed.  ;-)

   Then inlining getzero would merely add another # j.dep = i.dep 
   relation, so depchains are still there but the value optimization can 
   happen before inlining.  Having to do something like that I'd find 
   disgusting, and rather rewrite consume into acquire :)  Or make the 
   depchain relation somehow realistically implementable.
  
  I was actually OK with arithmetic cancellation breaking the dependency 
  chains.  Others on the committee felt otherwise, and I figured that (1) 
  I wouldn't be writing that kind of function anyway and (2) they knew

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 09:28:56AM -0800, Paul E. McKenney wrote:
 On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote:
  Hi,
  
  On Mon, 24 Feb 2014, Linus Torvalds wrote:
  
To me that reads like
   
  int i;
  int *q = i;
  int **p = q;
   
  atomic_XXX (p, CONSUME);
   
orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
want to say that it orders against aliased storage - but then go further
and include indirectly through a chain of pointers?!  Thus an
atomic read of a int * orders against any 'int' memory operation but
not against 'float' memory operations?
   
   No, it's not about type at all, and the chain of pointers can be
   much more complex than that, since the int * can point to within an
   object that contains other things than just that int (the int can
   be part of a structure that then has pointers to other structures
   etc).
  
  So, let me try to poke holes into your definition or increase my 
  understanding :) .  You said chain of pointers(dereferences I assume), 
  e.g. if p is result of consume load, then access to 
  p-here-there-next-prev-stuff is supposed to be ordered with that load 
  (or only when that last load/store itself is also an atomic load or 
  store?).
  
  So, what happens if the pointer deref chain is partly hidden in some 
  functions:
  
  A * adjustptr (B *ptr) { return ptr-here-there-next; }
  B * p = atomic_XXX (somewhere, consume);
  adjustptr(p)-prev-stuff = bla;
  
  As far as I understood you, this whole ptrderef chain business would be 
  only an optimization opportunity, right?  So if the compiler can't be sure 
  how p is actually used (as in my function-using case, assume adjustptr is 
  defined in another unit), then the consume load would simply be 
  transformed into an acquire (or whatever, with some barrier I mean)?  Only 
  _if_ the compiler sees all obvious uses of p (indirectly through pointer 
  derefs) can it, yeah, do what with the consume load?
 
 Good point, I left that out of my list.  Adding it:
 
 13.   By default, pointer chains do not propagate into or out of functions.
   In implementations having attributes, a [[carries_dependency]]
   may be used to mark a function argument or return as passing
   a pointer chain into or out of that function.
 
   If a function does not contain memory_order_consume loads and
   also does not contain [[carries_dependency]] attributes, then
   that function may be compiled using any desired dependency-breaking
   optimizations.
 
   The ordering effects are implementation defined when a given
   pointer chain passes into or out of a function through a parameter
   or return not marked with a [[carries_dependency]] attributed.
 
 Note that this last paragraph differs from the current standard, which
 would require ordering regardless.

And there is also kill_dependency(), which needs to be added to the list
in #8 of operators that take a chained pointer and return something that
is not a chained pointer.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 09:38:46AM -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz m...@suse.de wrote:
 
  So, let me try to poke holes into your definition or increase my
  understanding :) .  You said chain of pointers(dereferences I assume),
  e.g. if p is result of consume load, then access to
  p-here-there-next-prev-stuff is supposed to be ordered with that load
  (or only when that last load/store itself is also an atomic load or
  store?).
 
 It's supposed to be ordered wrt the first load (the consuming one), yes.
 
  So, what happens if the pointer deref chain is partly hidden in some
  functions:
 
 No problem.
 
 The thing is, the ordering is actually handled by the CPU in all
 relevant cases.  So the compiler doesn't actually need to *do*
 anything. All this legalistic stuff is just to describe the semantics
 and the guarantees.
 
 The problem is two cases:
 
  (a) alpha (which doesn't really order any accesses at all, not even
 dependent loads), but for a compiler alpha is actually trivial: just
 add a rmb instruction after the load, and you can't really do
 anything else (there's a few optimizations you can do wrt the rmb, but
 they are very specific and simple).
 
 So (a) is a problem, but the solution is actually really simple, and
 gives very *strong* guarantees: on alpha, a consume ends up being
 basically the same as a read barrier after the load, with only very
 minimal room for optimization.
 
  (b) ARM and powerpc and similar architectures, that guarantee the
 data dependency as long as it is an *actual* data dependency, and
 never becomes a control dependency.
 
 On ARM and powerpc, control dependencies do *not* order accesses (the
 reasons boil down to essentially: branch prediction breaks the
 dependency, and instructions that come after the branch can be happily
 executed before the branch). But it's almost impossible to describe
 that in the standard, since compilers can (and very much do) turn a
 control dependency into a data dependency and vice versa.
 
 So the current standard tries to describe that control vs data
 dependency, and tries to limit it to a data dependency. It fails. It
 fails for multiple reasons - it doesn't allow for trivial
 optimizations that just remove the data dependency, and it also
 doesn't allow for various trivial cases where the compiler *does* turn
 the data dependency into a control dependency.
 
 So I really really think that the current C standard language is
 broken. Unfixably broken.
 
 I'm trying to change the syntactic data dependency that the current
 standard uses into something that is clearer and correct.
 
 The chain of pointers thing is still obviously a data dependency,
 but by limiting it to pointers, it simplifies the language, clarifies
 the meaning, avoids all syntactic tricks (ie p-p is clearly a
 syntactic dependency on p, but does *not* involve in any way
 following the pointer) and makes it basically impossible for the
 compiler to break the dependency without doing value prediction, and
 since value prediction has to be disallowed anyway, that's a feature,
 not a bug.

OK, good point, please ignore my added thirteenth item in the list.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 10:14:01AM -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 9:21 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  4.  Bitwise operators (, |, ^, and I suppose also ~)
  applied to a chained pointer and an integer results in another
  chained pointer in that same pointer chain.
 
 No. You cannot define it this way. Taking the value of a pointer and
 doing a bitwise operation that throws away all the bits (or even
 *most* of the bits) results in the compiler easily being able to turn
 the chain into a non-chain.
 
 The obvious example being val  0, but things like val  1 are in
 practice also something that compilers easily turn into control
 dependencies instead of data dependencies.

Indeed, most of the bits need to remain for this to work.

 So you can talk about things like aligning the pointer value to
 object boundaries etc, but it really cannot and *must* not be about
 the syntactic operations.
 
 The same goes for adding and subtracting an integer. The *syntax*
 doesn't matter. It's about remaining information. Doing p-(int)p or
 p+(-(int)p) doesn't leave any information despite being subtracting
 and adding an integer at a syntactic level.
 
 Syntax is meaningless. Really.

Good points.  How about the following replacements?

3.  Adding or subtracting an integer to/from a chained pointer
results in another chained pointer in that same pointer chain.
The results of addition and subtraction operations that cancel
the chained pointer's value (for example, p-(long)p where p
is a pointer to char) are implementation defined.

4.  Bitwise operators (, |, ^, and I suppose also ~)
applied to a chained pointer and an integer for the purposes
of alignment and pointer translation results in another
chained pointer in that same pointer chain.  Other uses
of bitwise operators on chained pointers (for example,
p|~0) are implementation defined.

  8.  Applying any of the following operators to a chained pointer
  results in something that is not a chained pointer:
  (), sizeof, !, *, /, %, , , , , =,
  =, ==, !=, , and ||.
 
 Parenthesis? I'm assuming that you mean calling through the chained pointer.

Yes, good point.  Of course, parentheses for grouping just pass the
value through without affecting the chained-ness.

 Also, I think all of /, * and % are perfectly fine, and might be used
 for that aligning the pointer operation that is fine.

Something like this?

char *p;

p = p - (unsigned long)p % 8;

I was thinking of this as subtraction -- the p gets moduloed by 8,
which loses the chained-pointer designation.  But that is OK because
that designation gets folded back in by the subtraction.  Am I missing
a use case?

That leaves things like this one:

p = (p / 8) * 8;

I cannot think of any other legitimate use for / and *.

Here is an updated #8 and a new 8a:

8.  Applying any of the following operators to a chained pointer
results in something that is not a chained pointer: function call
(), sizeof, !, %, , , , , =, =, ==,
!=, , ||, and kill_dependency().

8a. Dividing a chained pointer by an integer and multiplying it
by that same integer (for example, to align that pointer) results
in a chained pointer in that same pointer chain.  The ordering
effects of other uses of infix * and / on chained pointers
are implementation defined.

Does that capture it?

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Paul E. McKenney
On Mon, Feb 24, 2014 at 03:35:04PM -0800, Linus Torvalds wrote:
 On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  What if the nothing modifies 'p' part looks like this:
 
  if (p != myvariable)
  return;
 
  and now any sane compiler will happily optimize q = *p into q =
  myvariable, and we're all done - nothing invalid was ever
 
  Yes, the compiler could do that.  But it would still be required to
  carry a dependency from the memory_order_consume read to the *p,
 
 But that's *BS*. You didn't actually listen to the main issue.
 
 Paul, why do you insist on this carries-a-dependency crap?

Sigh.  Read on...

 It's broken. If you don't believe me, then believe the compiler person
 who already piped up and told you so.
 
 The carries a dependency model is broken. Get over it.
 
 No sane compiler will ever distinguish two different registers that
 have the same value from each other. No sane compiler will ever say
 ok, register r1 has the exact same value as register r2, but r2
 carries the dependency, so I need to make sure to pass r2 to that
 function or use it as a base pointer.
 
 And nobody sane should *expect* a compiler to distinguish two
 registers with the same value that way.
 
 So the whole model is broken.
 
 I gave an alternate model (the restrict), and you didn't seem to
 understand the really fundamental difference. It's not a language
 difference, it's a conceptual difference.
 
 In the broken carries a dependency model, you have fight all those
 aliases that can have the same value, and it is not a fight you can
 win. We've had the p-p examples, we've had the p0 examples, but
 the fact is, that p==myvariable example IS EXACTLY THE SAME THING.
 
 All three of those things: p-p, p0, and p==myvariable mean
 that any compiler worth its salt now know that p carries no
 information, and will optimize it away.
 
 So please stop arguing against that. Whenever you argue against that
 simple fact, you are arguing against sane compilers.

So let me see if I understand your reasoning.  My best guess is that it
goes something like this:

1.  The Linux kernel contains code that passes pointers from
rcu_dereference() through external functions.

2.  Code in the Linux kernel expects the normal RCU ordering
guarantees to be in effect even when external functions are
involved.

3.  When compiling one of these external functions, the C compiler
has no way of knowing about these RCU ordering guarantees.

4.  The C compiler might therefore apply any and all optimizations
to these external functions.

5.  This in turn implies that we the only way to prohibit any given
optimization from being applied to the results obtained from
rcu_dereference() is to prohibit that optimization globally.

6.  We have to be very careful what optimizations are globally
prohibited, because a poor choice could result in unacceptable
performance degradation.

7.  Therefore, the only operations that can be counted on to
maintain the needed RCU orderings are those where the compiler
really doesn't have any choice, in other words, where any
reasonable way of computing the result will necessarily maintain
the needed ordering.

Did I get this right, or am I confused?

 So *accept* the fact that some operations (and I guarantee that there
 are more of those than you can think of, and you can create them with
 various tricks using pretty much *any* feature in the C language)
 essentially take the data information away. And just accept the fact
 that then the ordering goes away too.

Actually, the fact that there are more potential optimizations than I can
think of is a big reason for my insistence on the carries-a-dependency
crap.  My lack of optimization omniscience makes me very nervous about
relying on there never ever being a reasonable way of computing a given
result without preserving the ordering.

 So give up on carries a dependency. Because there will be cases
 where that dependency *isn't* carried.
 
 The language of the standard needs to get *away* from the broken
 model, because otherwise the standard is broken.
 
 I suggest we instead talk about litmus tests and why certain code
 sequences are ordered, and others are not.

OK...

 So the code sequence I already mentioned is *not* ordered:
 
 Litmus test 1:
 
 p = atomic_read(pp, consume);
 if (p == variable)
 return p-val;
 
is *NOT* ordered, because the compiler can trivially turn this into
 return variable.val, and break the data dependency.

Right, given your model, the compiler is free to produce code that
doesn't order the load from pp against the load from p-val.

This is true *regardless* of any carries a dependency language,
 because that language is insane, and doesn't work when the different
 pieces here may be in different compilation units.

Indeed

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-23 Thread Paul E. McKenney
On Sat, Feb 22, 2014 at 07:50:35PM -0800, Linus Torvalds wrote:
 On Sat, Feb 22, 2014 at 4:39 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  Agreed, by far the most frequent use is - to dereference and assignment
  to store into a local variable.  The other operations where the kernel
  expects ordering to be maintained are:
 
  o   Bitwise  to strip off low-order bits.  The FIB tree does
  this, for example in fib_table_lookup() in net/ipv4/fib_trie.c.
  The low-order bit is used to distinguish internal nodes from
  leaves -- nodes and leaves are different types of structures.
  (There are a few others.)
 
 Note that this is very much outside the scope of the C standard,
 regardless of 'consume' or not.
 
 We'll always do things outside the standard, so I wouldn't worry.

Fair enough.  I am going to make examples to try to make sure that we
aren't using the same words for two different meanings:

struct foo restrict *p;

p = atomic_load_explicit(gp, memory_order_consume);
p = ~0x3;  /* OK because p has restrict property. */
return p-a;/* Ordered WRT above load. */

Of course, the compiler would have to work pretty hard to break this
ordering, so I am with you in not worrying too much about this one.

  o   Uses ?: to substitute defaults in case of NULL pointers,
  but ordering must be maintained in the non-default case.
  Most, perhaps all, of these could be converted to if should
  ?: prove problematic.
 
 Note that this doesn't actually affect the restrict/ordering rule in
 theory: ?: isn't special according to those rules. The rules are
 fairly simple: we guarantee ordering only to the object that the
 pointer points to, and even that guarantee goes out the window if
 there is some *other* way to reach the object.
 
 ?: is not really relevant, except in the sense that *any* expression
 that ends up pointing to outside the object will lose the ordering
 guarantee. ?: can be one such expression, but so can p-p or anything
 like that.
 
 And in *practice*, the only thing that needs to be sure to generate
 special code is alpha, and there you'd just add the rmb after the
 load. That is sufficient to fulfill the guarantees.

OK, seems reasonable.  Reworking the earlier example:

struct foo restrict *p;

p = atomic_load_explicit(gp, memory_order_consume);
p = p ? p : default_foo;
return p-a;/* Ordered WRT above load if p non-NULL. */

And the ordering makes sense only in the non-NULL case anyway,
so this should be fine.

 On ARM and powerpc, the compiler obviously has to guarantee that it
 doesn't do value-speculation on the result, but again, that never
 really had anything to do with the whole carries a dependency, it is
 really all about the fact that in order to guarantee the ordering, the
 compiler mustn't generate that magical aliased pointer value. But if
 the aliased pointer value comes from the *source* code, all bets are
 off.

Agreed, compiler-based value speculation is dangerous in any case.
(Though the branch-predictor-based trick seems like it should be safe
on TSO systems like x86, s390, etc.)

 Now, even on alpha, the compiler can obviously move that rmb around.
 For example, if there is a conditional after the
 atomic_read(mo_consume), and the compiler can tell that the pointer
 that got read by mo_consume is dead along one branch, then the
 compiler can move the rmb to only exist in the other branch. Why?
 Because we inherently guarantee only the order to any accesses to the
 object the pointer pointed to, and that the pointer that got loaded is
 the *only* way to get to that object (in this context), so if the
 value is dead, then so is the ordering.

Yep!

 In fact, even if the value is *not* dead, but it is NULL, the compiler
 can validly say the NULL pointer cannot point to any object, so I
 don't have to guarantee any serialization. So code like this (writing
 alpha assembly, since in practice only alpha will ever care):
 
 ptr = atomic_read(pp, mo_consume);
 if (ptr) {
... do something with ptr ..
 }
 return ptr;
 
 can validly be translated to:
 
 ldq $1,0($2)
 beq $1,branch-over
 rmb
 .. the do-something code using register $1 ..
 
 because the compiler knows that a NULL pointer cannot be dereferenced,
 so it can decide to put the rmb in the non-NULL path - even though the
 pointer value is still *live* in the other branch (well, the liveness
 of a constant value is somewhat debatable, but you get the idea), and
 may be used by the caller (but since it is NULL, the use can not
 include accessing any object, only really testing)

Agreed.

 So note how this is actually very different from the carries
 dependency rule. It's simpler, and it allows much more natural
 optimizations.

I do have a couple more questions about it, but please see below.

  o   Addition and subtraction to adjust both

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-23 Thread Paul E. McKenney
On Sun, Feb 23, 2014 at 11:31:25AM -0800, Linus Torvalds wrote:
 On Sat, Feb 22, 2014 at 10:34 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  Adding and subtracting integers to/from a RCU-protected pointer makes
  sense to me.
 
 Ack. And that's normal access to an object behavior anyway.

And covers things like container_of() as well as array indexing and
field selection, so good.

  Adding and subtracting integers to/from an RCU-protected integer makes
  sense in many practical cases, but I worry about the compiler figuring
  out that the RCU-protected integer cancelled with some other integer.
 
 I suspect that in practice, all the same normal rules apply - assuming
 that there aren't any aliasing integers, and that the compiler
 doesn't do any value prediction, the end result *should* be exactly
 the same as for the pointer arithmetic. So even if the standard were
 to talk about just pointers, I suspect that in practice, there really
 isn't anything that can go wrong.

Yep, in practice in absence of the i-i code, it should work.  As long
as the integer returned from the memory_order_consume load is always
kept in an integer tagged as restrict, I believe that it should be
possible to make the standard guarantee it as well.

  I don't feel all that good about subtractions involving an RCU-protected
  pointer and another pointer, again due to the possibility of arithmetic
  optimizations cancelling everything.
 
 Actually, pointer subtraction is a pretty useful operation, even
 without the gotcha case of p-p just to force a fake dependency.
 Getting an index, or indeed just getting an offset within a structure,
 is valid and common, and people will and should do it. It doesn't
 really matter for my suggested language: it should be perfectly fine
 to do something like
 
ptr = atomic_read(pp, mo_consume);
index = ptr - array_base;
.. pass off 'index' to some function, which then re-generates the
 ptr using it ..

I believe that the devil is in the details of the regeneration of the
pointer.  Yet another example below.

 and the compiler will have no trouble generating code, and the
 suggested ordered wrt that object guarantee is that the eventual
 ordering is between the pointer load and the use in the function,
 regardless of how it got there (ie turning it into an index and back
 is perfectly fine).
 
 So both from a legalistic language wording standpoint, and from a
 compiler code generation standpoint, this is a non-issue.
 
 Now, if you actually lose information (ie some chain there drops
 enough data from the pointer that it cannot be recovered, partially of
 fully), and then regenerate the object value by faking it, and still
 end up accessing the right data, but without actually going through
 any of the the pointer that you loaded, that falls under the
 restricted heading, and you must clearly at least partially have
 used other information. In which case the standard wording wouldn't
 guarantee anything at all.

I agree in general, but believe that the pointer regeneration needs
to be done with care.  If you are adding the difference back into
the original variable ptr read from the RCU-protected pointer, I have
no problem with it.

My concern is with more elaborate cases.  For example:

struct foo g1[N];
struct bar g2[N];
struct foo restrict *p1;
struct bar *p2;
int index;

p1 = atomic_read_explicit(*gp1, memory_order_consume);
index = (ptr - g1) / 2;  /* Getting this into the BS syntax realm. */
p2 = g2 + index;

The standard as currently worded would force ordering between the read
into p1 and the read into p2.  But it does so via dependency chains,
so this is starting to feel to me like we are getting back into the
business of tracing dependency chains.

Of course, one can argue that parallel arrays are not all that useful
in the Linux kernel, and one can argue that dividing indexes by two is
beyond the pale.  (Although heaps in dense arrays really do this sort of
index arithmetic, I am having some trouble imagining why RCU protection
would be useful in that case.  Yes, I could make up something about
replacing one heap locklessly with another, but...)

I might well be in my usual overly paranoid mode, but this is the sort
of thing that makes me a bit nervous.  I would rather either have it
bulletproof safe, or to say don't do that.

Hmmm...  I don't recall seeing a use case for subtraction involving
an RCU-protected pointer and another pointer when I looked through
all the rcu_dereference() invocations in the Linux kernel.  Is there
a use case that I am missing?

  I don't see that you would need to protect anything but the first
  read, so I think you need rcu_dereference() only on the initial
  pointer access.
 
  Let me try an example:
 
  struct bar {
  int a;
  unsigned b;
  };
 
  struct foo {
  struct bar *next

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-23 Thread Paul E. McKenney
On Sun, Feb 23, 2014 at 05:35:28PM -0800, Linus Torvalds wrote:
 On Sun, Feb 23, 2014 at 5:16 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
   (a) we've said 'q' is restricted, so there is no aliasing between q
  and the pointers b/c. So the compiler is free to move those accesses
  around the q = p-next access.
 
  Ah, if I understand you, very good!
 
  My example intentionally left q -not- restricted.
 
 No, I 100% agree with that. q is *not* restricted. But p is, since
 it came from that consuming load.
 
 But q = p-next is ordered by how something can alias p-next, not by 'q'!
 
 There is no need to restrict anything but 'p' for all of this to work.

I cannot say I understand this last sentence right new from the viewpoint
of the standard, but suspending disbelief for the moment...

(And yes, given current compilers and CPUs, I agree that this should
all work in practice.  My concern is the legality, not the reality.)

 Btw, it's also worth pointing out that I do *not* in any way expect
 people to actually write the restrict keyword anywhere. So no need
 to change source code.

Understood -- in this variant, you are taking the marking from the
fact that there was an assignment from a memory_order_consume load
rather than from a keyword on the assigned-to variable's declaration.

 What you have is a situation where the pointer coming out of the
 memory_order_consume is restricted. But if you assign it to a
 non-restricted pointer, that's *fine*. That's perfectly normal C
 behavior. The restrict concept is not something that the programmer
 needs to worry about or ever even notice, it's basically just a
 promise to the compiler that if somebody has another pointer lying
 around, accesses though that other pointer do not require ordering.
 
 So it sounds like you believe that the programmer would mark things
 restrict, and I did not mean that at all.

Indeed I did believe that.

I must confess that I was looking for an easy way to express in
standardese -exactly- where the ordering guarantee did and did
not propagate.

The thing is that the vast majority of the Linux-kernel RCU code is more
than happy with the guarantee only applying to fetches via the pointer
returned from the memory_order_consume load.  There are relatively few
places where groups of structures are made visible to RCU readers via
a single rcu_assign_pointer().  I guess I need to actually count them.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-22 Thread Paul E. McKenney
On Sat, Feb 22, 2014 at 07:30:37PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140222183231.5...@emeavsc.vnet.ibm.com
 X-Xagent-Gateway: emeavsc.vnet.ibm.com (XAGSMTP2 at EMEAVSC)
 
 On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote:
  On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote:
   xagsmtp2.20140220172700.0...@vmsdvm4.vnet.ibm.com
   X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4)
   
   On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
 On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel trie...@redhat.com 
 wrote:
  On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
 
  Can you point to it? Because I can find a draft standard, and it 
  sure
  as hell does *not* contain any clarity of the model. It has a 
  *lot* of
  verbiage, but it's pretty much impossible to actually understand, 
  even
  for somebody who really understands memory ordering.
 
  http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
  This has an explanation of the model up front, and then the detailed
  formulae in Section 6.  This is from 2010, and there might have been
  smaller changes since then, but I'm not aware of any bigger ones.
 
 Ahh, this is different from what others pointed at. Same people,
 similar name, but not the same paper.
 
 I will read this version too, but from reading the other one and the
 standard in parallel and trying to make sense of it, it seems that I
 may have originally misunderstood part of the whole control dependency
 chain.
 
 The fact that the left side of ? :,  and || breaks data
 dependencies made me originally think that the standard tried very
 hard to break any control dependencies. Which I felt was insane, when
 then some of the examples literally were about the testing of the
 value of an atomic read. The data dependency matters quite a bit. The
 fact that the other Mathematical paper then very much talked about
 consume only in the sense of following a pointer made me think so even
 more.
 
 But reading it some more, I now think that the whole data dependency
 logic (which is where the special left-hand side rule of the ternary
 and logical operators come in) are basically an exception to the rule
 that sequence points end up being also meaningful for ordering (ok, so
 C11 seems to have renamed sequence points to sequenced before).
 
 So while an expression like
 
 atomic_read(p, consume) ? a : b;
 
 doesn't have a data dependency from the atomic read that forces
 serialization, writing
 
if (atomic_read(p, consume))
   a;
else
   b;
 
 the standard *does* imply that the atomic read is happens-before wrt
 a, and I'm hoping that there is no question that the control
 dependency still acts as an ordering point.

The control dependency should order subsequent stores, at least assuming
that a and b don't start off with identical stores that the compiler
could pull out of the if and merge.  The same might also be true for 
?:
for all I know.  (But see below)
   
   I don't think this is quite true.  I agree that a conditional store will
   not be executed speculatively (note that if it would happen in both the
   then and the else branch, it's not conditional); so, the store in
   a; (assuming it would be a store) won't happen unless the thread can
   really observe a true value for p.  However, this is *this thread's*
   view of the world, but not guaranteed to constrain how any other thread
   sees the state.  mo_consume does not contribute to
   inter-thread-happens-before in the same way that mo_acquire does (which
   *does* put a constraint on i-t-h-b, and thus enforces a global
   constraint that all threads have to respect).
   
   Is it clear which distinction I'm trying to show here?
  
  If you are saying that the control dependencies are a result of a
  combination of the standard and the properties of the hardware that
  Linux runs on, I am with you.  (As opposed to control dependencies being
  a result solely of the standard.)
 
 I'm not quite sure I understand what you mean :)  Do you mean the
 control dependencies in the binary code, or the logical control
 dependencies in source programs?

At present, the intersection of those two sets, but only including those
control dependencies beginning with with a memory_order_consume load or a
[[carries_dependency]] function argument or return value.

Or something like that.  ;-)

  This was a deliberate decision in 2007 or so.  At that time, the
  documentation on CPU memory orderings were pretty crude, and it was
  not clear that all relevant hardware respected control dependencies.
  Back then, if you wanted an authoritative answer even to a fairly simple
  memory

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-22 Thread Paul E. McKenney
On Sat, Feb 22, 2014 at 01:53:30PM -0800, Linus Torvalds wrote:
 On Sat, Feb 22, 2014 at 10:53 AM, Torvald Riegel trie...@redhat.com wrote:
 
  Stating that (1) the standard is wrong and (2) that you think that
  mo_consume semantics are not good is two different things.
 
 I do agree. They are two independent things.
 
 I think the standard is wrong, because it's overly complex, hard to
 understand, and nigh unimplementable. As shown by the bugzilla
 example, carries a dependency encompasses things that are *not* just
 synchronizing things just through a pointer, and as a result it's
 actually very complicated, since they could have been optimized away,
 or done in non-local code that wasn't even aware of the dependency
 carrying.
 
 That said, I'm reconsidering my suggested stricter semantics, because
 for RCU we actually do want to test the resulting pointer against NULL
 _without_ any implied serialization.
 
 So I still feel that the standard as written is fragile and confusing
 (and the bugzilla entry pretty much proves that it is also practically
 unimplementable as written), but strengthening the serialization may
 be the wrong thing.
 
 Within the kernel, the RCU use for this is literally purely about
 loading a pointer, and doing either:
 
  - testing its value against NULL (without any implied synchronization at all)
 
  - using it as a pointer to an object, and expecting that any accesses
 to that object are ordered wrt the consuming load.

Agreed, by far the most frequent use is - to dereference and assignment
to store into a local variable.  The other operations where the kernel
expects ordering to be maintained are:

o   Bitwise  to strip off low-order bits.  The FIB tree does
this, for example in fib_table_lookup() in net/ipv4/fib_trie.c.
The low-order bit is used to distinguish internal nodes from
leaves -- nodes and leaves are different types of structures.
(There are a few others.)

o   Uses ?: to substitute defaults in case of NULL pointers,
but ordering must be maintained in the non-default case.
Most, perhaps all, of these could be converted to if should
?: prove problematic.

o   Addition and subtraction to adjust both pointers to and indexes
into RCU-protected arrays.  There are not that many indexes,
and they could be converted to pointers, but the addition and
subtraction looks necessary in a some cases.

o   Array indexing.  The value from rcu_dereference() is used both
before and inside the [], interestingly enough.

o   Casts along with unary  and *.

That said, I did not see any code that dependended on ordering through
the function-call (), boolean complement !, comparison (only ==
and !=), logical operators ( and ||), and the *, /, and %
arithmetic operators.

 So I actually have a suggested *very* different model that people
 might find more acceptable.
 
 How about saying that the result of a atomic_read(a, mo_consume) is
 required to be a _restricted_ pointer type, and that the consume
 ordering guarantees the ordering between that atomic read and the
 accesses to the object that the pointer points to.
 
 No carries a dependency, no nothing.

In the case of arrays, the object that the pointer points to is
considered to be the full array, right?

 Now, there's two things to note in there:
 
  - the restricted pointer part means that the compiler does not need
 to worry about serialization to that object through other possible
 pointers - we have basically promised that the *only* pointer to that
 object comes from the mo_consume. So that part makes it clear that the
 consume ordering really only is valid wrt that particular pointer
 load.

That could work, though there are some cases where a multi-linked
structure is made visible using a single rcu_assign_pointer(), and
rcu_dereference() is used only for the pointer leading to that
multi-linked structure, not for the pointers among the elements
making up that structure.  One way to handle this would be to
require rcu_dereference() to be used within the structure an well
as upon first traversal to the structure.

  - the to the object that the pointer points to makes it clear that
 you can't use the pointer to generate arbitrary other values and claim
 to serialize that way.
 
 IOW, with those alternate semantics, that gcc bugzilla example is
 utterly bogus, and a compiler can ignore it, because while it tries to
 synchronize through the dependency chain created with that p-i+i
 expression, that is completely irrelevant when you use the above rules
 instead.
 
 In the bugzilla example, the object that *(p-i+i) accesses isn't
 actually the object pointed to by the pointer, so no serialization is
 implied. And if it actually *were* to be the same object, because p
 happens to have the same value as i, then the restrict part of the
 rule pops up and the compiler can again say that there is no ordering
 guarantee, since 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Paul E. McKenney
On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote:
 Hi,
 
 On Thu, 20 Feb 2014, Linus Torvalds wrote:
 
  But I'm pretty sure that any compiler guy must *hate* that current odd
  dependency-generation part, and if I was a gcc person, seeing that
  bugzilla entry Torvald pointed at, I would personally want to
  dismember somebody with a rusty spoon..
 
 Yes.  Defining dependency chains in the way the standard currently seems 
 to do must come from people not writing compilers.  There's simply no 
 sensible way to implement it without being really conservative, because 
 the depchains can contain arbitrary constructs including stores, 
 loads and function calls but must still be observed.  
 
 And with conservative I mean everything is a source of a dependency, and 
 hence can't be removed, reordered or otherwise fiddled with, and that 
 includes code sequences where no atomic objects are anywhere in sight [1].
 In the light of that the only realistic way (meaning to not have to 
 disable optimization everywhere) to implement consume as currently 
 specified is to map it to acquire.  At which point it becomes pointless.

No, only memory_order_consume loads and [[carries_dependency]]
function arguments are sources of dependency chains.

  So I suspect there are a number of people who would be *more* than
  happy with a change to those odd dependency rules.
 
 I can't say much about your actual discussion related to semantics of 
 atomics, not my turf.  But the carries a dependency relation is not 
 usefully implementable.
 
 
 Ciao,
 Michael.
 [1] Simple example of what type of transformations would be disallowed:
 
 int getzero (int i) { return i - i; }

This needs to be as follows:

[[carries_dependency]] int getzero(int i [[carries_dependency]])
{
return i - i;
}

Otherwise dependencies won't get carried through it.

 Should be optimizable to return 0;, right?  Not with carries a 
 dependency in place:
 
 int jeez (int idx) {
   int i = atomic_load(idx, memory_order_consume); // A
   int j = getzero (i);// B
   return array[j];// C
 }
 
 As I read carries a dependency there's a dependency from A to C. 
 Now suppose we would optimize getzero in the obvious way, then inline, and 
 boom, dependency gone.  So we wouldn't be able to optimize any function 
 when we don't control all its users, for fear that it _might_ be used in 
 some dependency chain where it then matters that we possibly removed some 
 chain elements due to the transformation.  We would have to retain 'i-i' 
 before inlining, and if the function then is inlined into a context where 
 depchains don't matter, could _then_ optmize it to zero.  But that's 
 insane, especially considering that it's hard to detect if a given context 
 doesn't care for depchains, after all the depchain relation is constructed 
 exactly so that it bleeds into nearly everywhere.  So we would most of 
 the time have to assume that the ultimate context will be depchain-aware 
 and therefore disable many transformations.

Any function that does not contain a memory_order_consume load and that
doesn't have any arguments marked [[carries_dependency]] can be optimized
just as before.

 There'd be one solution to the above, we would have to invent some special 
 operands and markers that explicitely model carries-a-dep, ala this:
 
 int getzero (int i) {
   #RETURN.dep = i.dep
   return 0;
 }

The above is already handled by the [[carries_dependency]] attribute,
see above.

 int jeez (int idx) {
   # i.dep = idx.dep
   int i = atomic_load(idx, memory_order_consume); // A
   # j.dep = i.dep
   int j = getzero (i);// B
   # RETURN.dep = j.dep + array.dep
   return array[j];// C
 }
 
 Then inlining getzero would merely add another # j.dep = i.dep relation, 
 so depchains are still there but the value optimization can happen before 
 inlining.  Having to do something like that I'd find disgusting, and 
 rather rewrite consume into acquire :)  Or make the depchain relation 
 somehow realistically implementable.

I was actually OK with arithmetic cancellation breaking the dependency
chains.  Others on the committee felt otherwise, and I figured that
(1) I wouldn't be writing that kind of function anyway and (2) they
knew more about writing compilers than I.  I would still be OK saying
that things like i-i, i*0, i%1, i0, i|~0 and so on just
break the dependency chain.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Paul E. McKenney
On Fri, Feb 21, 2014 at 06:28:05PM +, Peter Sewell wrote:
 On 20 February 2014 17:01, Linus Torvalds torva...@linux-foundation.org
 wrote:

[ . . . ]

   So, if you make one of two changes to your example, then I will agree
   with you.
 
  No. We're not playing games here. I'm fed up with complex examples
  that make no sense.
 
  Nobody sane writes code that does that pointer comparison, and it is
  entirely immaterial what the compiler can do behind our backs. The C
  standard semantics need to make sense to the *user* (ie programmer),
  not to a CPU and not to a compiler. The CPU and compiler are tools.
  They don't matter. Their only job is to make the code *work*, dammit.
 
  So no idiotic made-up examples that involve code that nobody will ever
  write and that have subtle issues.
 
  So the starting point is that (same example as before, but with even
  clearer naming):
 
  Initialization state:
initialized = 0;
value = 0;
 
  Consumer:
 
  return atomic_read(initialized, consume) ? value : -1;
 
  Writer:
  value = 42;
  atomic_write(initialized, 1, release);
 
  and because the C memory ordering standard is written in such a way
  that this is subtly buggy (and can return 0, which is *not* logically
  a valid value), then I think the C memory ordering standard is broken.
 
  That consumer memory ordering is dangerous as hell, and it is
  dangerous FOR NO GOOD REASON.
 
  The trivial fix to the standard would be to get rid of all the
  carries a dependency crap, and just say that *anything* that depends
  on it is ordered wrt it.
 
 
 There are two difficulties with this, if I understand correctly what
 you're proposing.
 
 The first is knowing where to stop.  If one includes all data and
 control dependencies, pretty much all the continuation of execution
 would depend on the consume read, so the compiler would eventually
 have to give up and insert a gratuitous barrier.  One might imagine
 somehow annotating the return_expensive_system_value() you have below
 to say stop dependency tracking at the return (thereby perhaps
 enabling the compiler to optimise the barrier that you'd need in h/w
 to order the Linus-consume-read of initialised and the non-atomic read
 of calculated, replacing it by a compiler-introduced artificial
 dependency), and indeed that's roughly what the standard's
 kill_dependency does for consumes.

One way to tell the compiler where to stop would be to place markers
in the source code saying where dependencies stop.  These markers
could be provided by the definitions of the current rcu_read_unlock()
tags in the Linux kernel (and elsewhere, for that matter).  These would
be overridden by [[carries_dependency]] tags on function arguments and
return values, which is needed to handle the possibility of nested RCU
read-side critical sections.

 The second is the proposal in later mails to use some notion of
 semantic dependency instead of this syntactic one.  That's maybe
 attractive at first sight, but rather hard to define in a decent way
 in general.  To define whether the consume load can really affect
 some subsequent value, you need to know about all the set of possible
 executions of the program - which is exactly what we have to define.
 
 For syntactic dependencies, in contrast, you can at least tell whether
 they exist by examining the source code you have in front of you.  The
 fact that artificial dependencies like (x + y-y) are significant is
 (I guess) basically incidental at the C level - sometimes things like
 this are the best idiom to enforce ordering at the assembly level, but
 in C I imagine they won't normally arise.  If they do, it might be
 nicer to have a more informative syntax, eg (x + dependency(y)).

This was in fact one of the arguments put forward in favor of carrying
dependencies through things like y-y back in the 2007-8 timeframe.

Can't say that I am much of a fan of manually tagging all dependencies:
Machines are much better at that sort of thing than are humans.
But just out of curiosity, did you instead mean (x + dependency(y-y))
or some such?

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Paul E. McKenney
On Fri, Feb 21, 2014 at 10:10:54PM +, Joseph S. Myers wrote:
 On Fri, 21 Feb 2014, Paul E. McKenney wrote:
 
  This needs to be as follows:
  
  [[carries_dependency]] int getzero(int i [[carries_dependency]])
  {
  return i - i;
  }
  
  Otherwise dependencies won't get carried through it.
 
 C11 doesn't have attributes at all (and no specification regarding calls 
 and dependencies that I can see).  And the way I read the C++11 
 specification of carries_dependency is that specifying carries_dependency 
 is purely about increasing optimization of the caller: that if it isn't 
 specified, then the caller doesn't know what dependencies might be 
 carried.  Note: The carries_dependency attribute does not change the 
 meaning of the program, but may result in generation of more efficient 
 code. - end note.

Good point -- I am so used to them being in gcc that I missed that.

In which case, it seems to me that straight C11 is within its rights
to emit a memory barrier just before control passes into a function
that either it can't see or that it chose to apply dependency-breaking
optimizations to.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote:
 On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  The control dependency should order subsequent stores, at least assuming
  that a and b don't start off with identical stores that the compiler
  could pull out of the if and merge.  The same might also be true for ?:
  for all I know.  (But see below)
 
 Stores I don't worry about so much because
 
  (a) you can't sanely move stores up in a compiler anyway
  (b) no sane CPU or moves stores up, since they aren't on the critical path
 
 so a read-cmp-store is actually really hard to make anything sane
 re-order. I'm sure it can be done, and I'm sure it's stupid as hell.
 
 But that it's hard to screw up is *not* true for a load-cmp-load.
 
 So lets make this really simple: if you have a consume-cmp-read, is
 the ordering of the two reads guaranteed?

Not as far as I know.  Also, as far as I know, there is no difference
between consume and relaxed in the consume-cmp-read case.

  As long as there is an unbroken chain of -data- dependencies from the
  consume to the later access in question, and as long as that chain
  doesn't go through the excluded operations, yes.
 
 So let's make it *really* specific, and make it real code doing a real
 operation, that is actually realistic and reasonable in a threaded
 environment, and may even be in some critical code.
 
 The issue is the read-side ordering guarantee for 'a' and 'b', for this case:
 
  - Initial state:
 
a = b = 0;
 
  - Thread 1 (consumer):
 
 if (atomic_read(a, consume))
  return b;
 /* not yet initialized */
 return -1;
 
  - Thread 2 (initializer):
 
  b = some_value_lets_say_42;
  /* We are now ready to party */
  atomic_write(a, 1, release);
 
 and quite frankly, if there is no ordering guarantee between the read
 of a and the read of b in the consumer thread, then the C atomics
 standard is broken.
 
 Put another way: I claim that if thread 1 ever sees a return value
 other than -1 or 42, then the whole definition of atomics is broken.

The above example can have a return value of 0 if translated
straightforwardly into either ARM or Power, right?  Both of these can
speculate a read into a conditional, and both can translate a consume
load into a plain load if data dependencies remain unbroken.

So, if you make one of two changes to your example, then I will agree
with you.  The first change is to have a real data dependency between
the read of a and the second read:

 - Initial state:

   a = c, b = 0; c = 0;

 - Thread 1 (consumer):

if (atomic_read(a, consume))
 return *a;
/* not yet initialized */
return -1;

 - Thread 2 (initializer):

 b = some_value_lets_say_42;
 /* We are now ready to party */
 atomic_write(a, b, release);

The second change is to make the consume be an acquire:

 - Initial state:

   a = b = 0;

 - Thread 1 (consumer):

if (atomic_read(a, acquire))
 return b;
/* not yet initialized */
return -1;

 - Thread 2 (initializer):

 b = some_value_lets_say_42;
 /* We are now ready to party */
 atomic_write(a, 1, release);

In theory, you could also change the return to a store, but the example
gets a bit complicated and as far as I can tell you get into the state
where the standard does not explicitly support it, even though I have
a hard time imagining an actual implementation that fails to support it.

 Question 2: and what changes if the atomic_read() is turned into an
 acquire, and why? Does it start working?

Yep, that is the second change above.

  Neither has a data-dependency guarantee, because there is no data
  dependency from the load to either a or b.  After all, the value
  loaded got absorbed into the if condition.  However, according to
  discussions earlier in this thread, the if variant would have a
  control-dependency ordering guarantee for any stores in a and b
  (but not loads!).
 
 So exactly what part of the standard allows the loads to be
 re-ordered, and why? Quite frankly, I'd think that any sane person
 will agree that the above code snippet is realistic, and that my
 requirement that thread 1 sees either -1 or 42 is valid.

Unless I am really confused, weakly ordered systems would be just fine
returning zero in your original example.  In the case of ARM, I believe
you need either a data dependency, an ISB after the branch, or a DMB
instruction to force the ordering you want.  In the case of PowerPC, I
believe that you need either a data dependency, an isync after the branch,
an lwsync, or a sync.  I would not expect a C compiler to generate code
with any of these precautions in place.

 And if the C standards body has said that control dependencies break
 the read ordering, then I really think that the C standards committee
 has screwed up.

The control dependencies don't break the read ordering, but rather, they
are insufficient to preserve the read

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 12:30:32AM -0800, Paul E. McKenney wrote:
 On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote:

[ . . . ]

 So, if you make one of two changes to your example, then I will agree
 with you.  The first change is to have a real data dependency between
 the read of a and the second read:
 
  - Initial state:
 
a = c, b = 0; c = 0;
 
  - Thread 1 (consumer):
 
 if (atomic_read(a, consume))

And the above should be if (atomic_read(a, consume) != c).  Sigh!!!

Thanx, Paul

  return *a;
 /* not yet initialized */
 return -1;
 
  - Thread 2 (initializer):
 
  b = some_value_lets_say_42;
  /* We are now ready to party */
  atomic_write(a, b, release);



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 09:01:06AM -0800, Linus Torvalds wrote:
 On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  So lets make this really simple: if you have a consume-cmp-read, is
  the ordering of the two reads guaranteed?
 
  Not as far as I know.  Also, as far as I know, there is no difference
  between consume and relaxed in the consume-cmp-read case.
 
 Ok, quite frankly, I think that means that consume is misdesigned.
 
  The above example can have a return value of 0 if translated
  straightforwardly into either ARM or Power, right?
 
 Correct. And I think that is too subtle. It's dangerous, it makes code
 that *looks* correct work incorrectly, and it actually happens to work
 on x86 since x86 doesn't have crap-for-brains memory ordering
 semantics.
 
  So, if you make one of two changes to your example, then I will agree
  with you.
 
 No. We're not playing games here. I'm fed up with complex examples
 that make no sense.

Hey, your original example didn't do what you wanted given the current
standard.  Those two modified examples are no more complex than your
original, and are the closest approximations that I can come up with
right off-hand that provide the result you wanted.

 Nobody sane writes code that does that pointer comparison, and it is
 entirely immaterial what the compiler can do behind our backs. The C
 standard semantics need to make sense to the *user* (ie programmer),
 not to a CPU and not to a compiler. The CPU and compiler are tools.
 They don't matter. Their only job is to make the code *work*, dammit.
 
 So no idiotic made-up examples that involve code that nobody will ever
 write and that have subtle issues.

There are places in the Linux kernel that do both pointer comparisons
and dereferences.  Something like the following:

p = rcu_dereference(gp);
if (!p)
p = default_structure;
do_something_with(p-a, p-b);

In the typical case where default_structure was initialized early on,
no ordering is needed in the !p case.

 So the starting point is that (same example as before, but with even
 clearer naming):
 
 Initialization state:
   initialized = 0;
   value = 0;
 
 Consumer:
 
 return atomic_read(initialized, consume) ? value : -1;
 
 Writer:
 value = 42;
 atomic_write(initialized, 1, release);
 
 and because the C memory ordering standard is written in such a way
 that this is subtly buggy (and can return 0, which is *not* logically
 a valid value), then I think the C memory ordering standard is broken.

You really need that consume to be acquire.

 That consumer memory ordering is dangerous as hell, and it is
 dangerous FOR NO GOOD REASON.
 
 The trivial fix to the standard would be to get rid of all the
 carries a dependency crap, and just say that *anything* that depends
 on it is ordered wrt it.
 
 That just means that on alpha, consume implies an unconditional read
 barrier (well, unless the value is never used and is loaded just
 because it is also volatile), on x86, consume is the same as
 acquire which is just a plain load with ordering guarantees, and on
 ARM or power you can still avoid the extra synchronization *if* the
 value is used just for computation and for following pointers, but if
 the value is used for a comparison, there needs to be a
 synchronization barrier.

Except in the default-value case, where no barrier is required.

People really do comparisons on the pointers that they get back from
rcu_dereference(), either to check for NULL (as above) or to check for
a specific pointer.  Ordering is not required in those cases.  The
cases that do require ordering from memory_order_consume values being
used in an if statement should instead be memory_order_acquire.

 Notice? Getting rid of the stupid carries-dependency crap from the
 standard actually
  (a) simplifies the standard
  (b) means that the above obvious example *works*
  (c) does not in *any* way make for any less efficient code generation
 for the cases that consume works correctly for in the current
 mis-designed standard.
  (d) is actually a hell of a lot easier to explain to a compiler
 writer, and I can guarantee that it is simpler to implement too.
 
 Why do I claim (d) it is simpler to implement - because on ARM/power
 you can implement it *exactly* as a special acquire, with just a
 trivial peep-hole special case that follows the use chain of the
 acquire op to the consume, and then just drop the acquire bit if the
 only use is that compute-to-load chain.

I don't believe that it is quite that simple.

But yes, the compiler guys would be extremely happy to simply drop
memory_order_consume from the standard, as it is the memory order
that they most love to hate.

Getting them to agree to any sort of peep-hole optimization semantics
for memory_order_consume is likely problematic.  

 In fact, realistically, the *only* thing you need to actually care
 about for the intended use case of consume

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote:
 xagsmtp2.20140220172700.0...@vmsdvm4.vnet.ibm.com
 X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4)
 
 On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote:
  On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
   On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel trie...@redhat.com 
   wrote:
On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
   
Can you point to it? Because I can find a draft standard, and it sure
as hell does *not* contain any clarity of the model. It has a *lot* of
verbiage, but it's pretty much impossible to actually understand, even
for somebody who really understands memory ordering.
   
http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
This has an explanation of the model up front, and then the detailed
formulae in Section 6.  This is from 2010, and there might have been
smaller changes since then, but I'm not aware of any bigger ones.
   
   Ahh, this is different from what others pointed at. Same people,
   similar name, but not the same paper.
   
   I will read this version too, but from reading the other one and the
   standard in parallel and trying to make sense of it, it seems that I
   may have originally misunderstood part of the whole control dependency
   chain.
   
   The fact that the left side of ? :,  and || breaks data
   dependencies made me originally think that the standard tried very
   hard to break any control dependencies. Which I felt was insane, when
   then some of the examples literally were about the testing of the
   value of an atomic read. The data dependency matters quite a bit. The
   fact that the other Mathematical paper then very much talked about
   consume only in the sense of following a pointer made me think so even
   more.
   
   But reading it some more, I now think that the whole data dependency
   logic (which is where the special left-hand side rule of the ternary
   and logical operators come in) are basically an exception to the rule
   that sequence points end up being also meaningful for ordering (ok, so
   C11 seems to have renamed sequence points to sequenced before).
   
   So while an expression like
   
   atomic_read(p, consume) ? a : b;
   
   doesn't have a data dependency from the atomic read that forces
   serialization, writing
   
  if (atomic_read(p, consume))
 a;
  else
 b;
   
   the standard *does* imply that the atomic read is happens-before wrt
   a, and I'm hoping that there is no question that the control
   dependency still acts as an ordering point.
  
  The control dependency should order subsequent stores, at least assuming
  that a and b don't start off with identical stores that the compiler
  could pull out of the if and merge.  The same might also be true for ?:
  for all I know.  (But see below)
 
 I don't think this is quite true.  I agree that a conditional store will
 not be executed speculatively (note that if it would happen in both the
 then and the else branch, it's not conditional); so, the store in
 a; (assuming it would be a store) won't happen unless the thread can
 really observe a true value for p.  However, this is *this thread's*
 view of the world, but not guaranteed to constrain how any other thread
 sees the state.  mo_consume does not contribute to
 inter-thread-happens-before in the same way that mo_acquire does (which
 *does* put a constraint on i-t-h-b, and thus enforces a global
 constraint that all threads have to respect).
 
 Is it clear which distinction I'm trying to show here?

If you are saying that the control dependencies are a result of a
combination of the standard and the properties of the hardware that
Linux runs on, I am with you.  (As opposed to control dependencies being
a result solely of the standard.)

This was a deliberate decision in 2007 or so.  At that time, the
documentation on CPU memory orderings were pretty crude, and it was
not clear that all relevant hardware respected control dependencies.
Back then, if you wanted an authoritative answer even to a fairly simple
memory-ordering question, you had to find a hardware architect, and you
probably waited weeks or even months for the answer.  Thanks to lots
of work from the Cambridge guys at about the time that the standard was
finalized, we have a much better picture of what the hardware does.

  That said, in this case, you could substitute relaxed for consume and get
  the same effect.  The return value from atomic_read() gets absorbed into
  the if condition, so there is no dependency-ordered-before relationship,
  so nothing for consume to do.
  
  One caution...  The happens-before relationship requires you to trace a
  full path between the two operations of interest.  This is illustrated
  by the following example, with both x and y initially zero:
  
  T1: atomic_store_explicit(x, 1, memory_order_relaxed);
  r1 = atomic_load_explicit(y

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 06:54:06PM +0100, Torvald Riegel wrote:
 xagsmtp4.20140220175519.1...@vmsdvm6.vnet.ibm.com
 X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP4 at VMSDVM6)
 
 On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote:
  Well, all the compilers currently convert consume to acquire, so you have
  your wish there.  Of course, that also means that they generate actual
  unneeded memory-barrier instructions, which seems extremely sub-optimal
  to me.
 
 GCC doesn't currently, but it also doesn't seem to track the
 dependencies, but that's a bug:
 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

Ah, cool!

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 09:34:57AM -0800, Linus Torvalds wrote:
 On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel trie...@redhat.com wrote:
 
  So the clarification is basically to the statement that the if
  (consume(p)) a version *would* have an ordering guarantee between the
  read of p and a, but the consume(p) ? a : b would *not* have
  such an ordering guarantee. Yes?
 
  Not as I understand it.  If my reply above wasn't clear, let me know and
  I'll try to rephrase it into something that is.
 
 Yeah, so you and Paul agree. And as I mentioned in the email that
 crossed with yours, I think that means that the standard is overly
 complex, hard to understand, fragile, and all *pointlessly* so.
 
 Btw, there are many ways that use a consume as an input to a
 conditional can happen. In particular, even if the result is actually
 *used* like a pointer as far as the programmer is concerned, tricks
 like pointer compression etc can well mean that the pointer is
 actually at least partly implemented using conditionals, so that some
 paths end up being only dependent through a comparison of the pointer
 value.
 
 So I very much did *not* want to complicate the litmus test code
 snippet when Paul tried to make it more complex, but I do think that
 there are cases where code that looks like pure pointer chasing
 actually is not for some cases, and then can become basically that
 litmus test for some path.
 
 Just to give you an example: in simple list handling it is not at all
 unusual to have a special node that is discovered by comparing the
 address, not by just loading from the pointer and following the list
 itself. Examples of that would be a HEAD node in a doubly linked list
 (Linux uses this concept quite widely, it's our default list
 implementation), or it could be a place-marker (cursor entry) in the
 list for safe traversal in the presence of concurrent deletes etc.

Yep, good example.  And also an example where the comparison does not
need to create any particular ordering.

 And obviously there is the already much earlier mentioned
 compiler-induced compare, due to value speculation, that can basically
 create such sequences even wherethey did not originally exist in the
 source code itself.
 
 So even if you work with pointer dereferences, and thus match that
 particular consume pattern, I really don't see why anybody would think
 that hey, we can ignore any control dependencies is a good idea.
 It's a *TERRIBLE* idea.
 
 And as mentioned, it's a terrible idea with no upsides. It doesn't
 help compiler optimizations for the case it's *intended* to help,
 since those optimizations can still be done without the horribly
 broken semantics. It doesn't help compiler writers, it just confuses
 them.
 
 And it sure as hell doesn't help users.

Yep.  And in 2014 we know a lot more about the hardware, so could make a
reasonable proposal.  In contrast, back in 2007 and 2008 memory ordering
was much more of a dark art, and proposals for control dependencies
therefore didn't get very far.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 10:32:51AM -0800, Linus Torvalds wrote:
 On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  You really need that consume to be acquire.
 
 So I think we now all agree that that is what the standard is saying.
 
 And I'm saying that that is wrong, that the standard is badly written,
 and should be fixed.
 
 Because before the standard is fixed, I claim that consume is
 unusable. We cannot trust it. End of story.

We get exactly those same issues with control dependencies.

The example gcc breakage was something like this:

i = atomic_load(idx, memory_order_consume);
x = array[0 + i - i];

Then gcc optimized this to:

i = atomic_load(idx, memory_order_consume);
x = array[0];

This same issue would hit control dependencies.  You are free to argue
that this is the fault of ARM and PowerPC memory ordering, but the fact
remains that your suggested change has -exactly- the same vulnerability
as memory_order_consume currently has.

 The fact that apparently gcc is currently buggy because it got the
 dependency calculations *wrong* just reinforces my point.
 
 The gcc bug Torvald pointed at is exactly because the current C
 standard is illogical unreadable CRAP. I can guarantee that what
 happened is:
 
  - the compiler saw that the result of the read was used as the left
 hand expression of the ternary ? : operator
 
  - as a result, the compiler decided that there's no dependency
 
  - the compiler didn't think about the dependency that comes from the
 result of the load *also* being used as the middle part of the ternary
 expression, because it had optimized it away, despite the standard not
 talking about that at all.
 
  - so the compiler never saw the dependency that the standard talks about

No, the dependency was in a cancelling arithmetic expression as shown
above, so that gcc optimized the dependency away.  Then the ordering
was lost on AARCH64.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448

 BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT.
 
 My suggested language never had any of these problems, because *my*
 suggested semantics are clear, logical, and don't have these kinds of
 idiotic pit-falls.
 
 Solution: Fix the f*cking C standard. No excuses, no explanations.
 Just get it fixed.

I agree that the standard needs help, but your suggested fix has the
same problems as shown in the bugzilla.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 07:44:32PM +0100, Torvald Riegel wrote:
 xagsmtp3.20140220184514.1...@bldgate.vnet.ibm.com
 X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP3 at BLDGATE)
 
 On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote:
  But yes, the compiler guys would be extremely happy to simply drop
  memory_order_consume from the standard, as it is the memory order
  that they most love to hate.
  
  Getting them to agree to any sort of peep-hole optimization semantics
  for memory_order_consume is likely problematic.  
 
 I wouldn't be so pessimistic about that.  If the transformations can be
 shown to be always correct in terms of the semantics specified in the
 standard, and if the performance win is sufficiently large, why not?  Of
 course, somebody has to volunteer to actually implement it :)

I guess that there is only one way to find out.  ;-)

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Paul E. McKenney
On Thu, Feb 20, 2014 at 11:45:29AM -0800, Linus Torvalds wrote:
 On Thu, Feb 20, 2014 at 10:56 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  The example gcc breakage was something like this:
 
  i = atomic_load(idx, memory_order_consume);
  x = array[0 + i - i];
 
  Then gcc optimized this to:
 
  i = atomic_load(idx, memory_order_consume);
  x = array[0];
 
  This same issue would hit control dependencies.  You are free to argue
  that this is the fault of ARM and PowerPC memory ordering, but the fact
  remains that your suggested change has -exactly- the same vulnerability
  as memory_order_consume currently has.
 
 No it does not, for two reasons, first the legalistic (and bad) reason:
 
 As I actually described it, the consume becomes an acquire by
 default. If it's not used as an address to the dependent load, then
 it's an acquire. The use going away in no way makes the acquire go
 away in my simplistic model.
 
 So the compiler would actually translate that to a load-with-acquire,
 not be able to remove the acquire, and we have end of story. The
 actual code generation would be that ld + sync + ld on powerpc, or
 ld.acq on ARM.
 
 Now, the reason I claim that reason was legalistic and bad is that
 it's actually a cop-out, and if you had made the example be something
 like this:
 
p = atomic_load(ptr, memory_order_consume);
x = array[0 + p - p];
y = p-val;
 
 then yes, I actually think that the order of loads of 'x' and 'p' are
 not enforced by the consume. The only case that is clear is the
 order of 'y' and 'p', because that is the only one that really *USES*
 the value.
 
 The use of +p-p is syntactic bullshit. It's very obvious to even a
 slightly developmentally challenged hedgehog that +p-p doesn't have
 any actual *semantic* meaning, it's purely syntactic.
 
 And the syntactic meaning is meaningless and doesn't matter. Because I
 would just get rid of the whole dependency chain language
 ALTOGETHER.
 
 So in fact, in my world, I would consider your example to be a
 non-issue. In my world, there _is_ no dependency chain at a
 syntactic level. In my SANE world, none of that insane crap language
 exists. That language is made-up and tied to syntax exactly because it
 *cannot* be tied to semantics.
 
 In my sane world, consume has a much simpler meaning, and has no
 legalistic syntactic meaning: only real use matters. If the value can
 be optimized away, the so can the barrier, and so can the whole load.
 The value isn't consumed, so it has no meaning.
 
 So if you write
 
i = atomic_load(idx, memory_order_consume);
x = array[0+i-i];
 
 then in my world that +i-i is meaningless. It's semantic fluff, and
 while my naive explanation would have left it as an acquire (because
 it cannot be peep-holed away), I actually do believe that the compiler
 should be obviously allowed to optimize the load away entirely since
 it's meaningless, and if no use of 'i' remains, then it has no
 consumer, and so there is no dependency.
 
 Put another way: consume is not about getting a lock, it's about
 getting a *value*. Only the dependency on the *value* matters, and if
 the value is optimized away, there is no dependency.
 
 And the value itself does not have any semantics. There's nothing
 volatile about the use of the value that would mean that the
 compiler cannot re-order it or remove it entirely. There's no barrier
 carried around by the value per se. The barrier is between the load
 and use. That's the *point* of consume after all.
 
 The whole chain of dependency language is pointless. It's wrong.
 It's complicated, it is illogical, and it causes subtle problems
 exactly because it got tied to the language *syntax* rather than to
 any logical use.
 
 Don't try to re-introduce the whole issue. It was a mistake for the C
 standard to talk about dependencies in the first place, exactly
 because it results in these idiotic legalistic practices.
 
 You do realize that that whole *(q+flag-flag) example in the
 bugzilla comes from the fact that the programmer tried to *fight* the
 fact that the C standard got the control dependency wrong?
 
 In other words, the *deepest* reason for that bugzilla is that the
 programmer tried to force the logical dependency by rewriting it as a
 (fake, and easily optimizable) data dependency.
 
 In *my* world, the stupid data-vs-control dependency thing goes away,
 the test of the value itself is a use of it, and *p ? *q :0 just
 does the right thing, there's no reason to do that q+flag-flag thing
 in the first place, and if you do, the compiler *should* just ignore
 your little games.

Linus, given that you are calling me out for pushing legalistic and bad
things, syntactic bullshit, and playing little games, I am forced
to conclude that you have never attended any sort of standards-committee
meeting.  ;-)

That said, I am fine with pushing control/data dependencies with this
general approach

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Paul E. McKenney
On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:
 On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote:
  On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
   xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
   X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
   
   On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
 On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com 
 wrote:
  On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
  And exactly because I know enough, I would *really* like atomics 
  to be
  well-defined, and have very clear - and *local* - rules about how 
  they
  can be combined and optimized.
 
  Local?
 
 Yes.
 
 So I think that one of the big advantages of atomics over volatile is
 that they *can* be optimized, and as such I'm not at all against
 trying to generate much better code than for volatile accesses.
 
 But at the same time, that can go too far. For example, one of the
 things we'd want to use atomics for is page table accesses, where it
 is very important that we don't generate multiple accesses to the
 values, because parts of the values can be change *by*hardware* (ie
 accessed and dirty bits).
 
 So imagine that you have some clever global optimizer that sees that
 the program never ever actually sets the dirty bit at all in any
 thread, and then uses that kind of non-local knowledge to make
 optimization decisions. THAT WOULD BE BAD.

Might as well list other reasons why value proofs via whole-program
analysis are unreliable for the Linux kernel:

1.  As Linus said, changes from hardware.
   
   This is what's volatile is for, right?  (Or the weak-volatile idea I
   mentioned).
   
   Compilers won't be able to prove something about the values of such
   variables, if marked (weak-)volatile.
  
  Yep.
  
2.  Assembly code that is not visible to the compiler.
Inline asms will -normally- let the compiler know what
memory they change, but some just use the memory tag.
Worse yet, I suspect that most compilers don't look all
that carefully at .S files.

Any number of other programs contain assembly files.
   
   Are the annotations of changed memory really a problem?  If the memory
   tag exists, isn't that supposed to mean all memory?
   
   To make a proof about a program for location X, the compiler has to
   analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
   the compiler will simply not be able to prove a thing (except maybe due
   to the data-race-free requirement for non-atomics).  The attempt to
   prove something isn't unreliable, simply because a correct compiler
   won't claim to be able to prove something.
  
  I am indeed less worried about inline assembler than I am about files
  full of assembly.  Or files full of other languages.
  
   One reason that could corrupt this is that if program addresses objects
   other than through the mechanisms defined in the language.  For example,
   if one thread lays out a data structure at a constant fixed memory
   address, and another one then uses the fixed memory address to get
   access to the object with a cast (e.g., (void*)0x123).
  
  Or if the program uses gcc linker scripts to get the same effect.
  
3.  Kernel modules that have not yet been written.  Now, the
compiler could refrain from trying to prove anything about
an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
is currently no way to communicate this information to the
compiler other than marking the variable volatile.
   
   Even if the variable is just externally accessible, then the compiler
   knows that it can't do whole-program analysis about it.
   
   It is true that whole-program analysis will not be applicable in this
   case, but it will not be unreliable.  I think that's an important
   difference.
  
  Let me make sure that I understand what you are saying.  If my program has
  extern int foo;, the compiler will refrain from doing whole-program
  analysis involving foo?
 
 Yes.  If it can't be sure to actually have the whole program available,
 it can't do whole-program analysis, right?  Things like the linker
 scripts you mention or other stuff outside of the language semantics
 complicates this somewhat, and maybe some compilers assume too much.
 There's also the point that data-race-freedom is required for
 non-atomics even if those are shared with non-C-code.
 
 But except those corner cases, a compiler sees whether something escapes
 and becomes visible/accessible to other entities.

The traditional response to except those corner cases is of course
Murphy was an optimist.  ;-)

That said, point taken -- you

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Paul E. McKenney
On Wed, Feb 19, 2014 at 06:55:51PM +0100, Torvald Riegel wrote:
 On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote:
  On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote:

[ . . . ]

   On both sides, the compiler will see that mmap() (or similar) is called,
   so that means the data escapes to something unknown, which could create
   threads and so on.  So first, it can't do whole-program analysis for
   this state anymore, and has to assume that other C11 threads are
   accessing this memory.  Next, lock-free atomics are specified to be
   address-free, meaning that they must work independent of where in
   memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a
   should and non-normative, but essential IMO).  Thus, this then boils
   down to just a simple case of synchronization.  (Of course, the rest of
   the ABI has to match too for the data exchange to work.)
  
  The compiler will see mmap() on the user side, but not on the kernel
  side.  On the kernel side, something special is required.
 
 Maybe -- you'll certainly know better :)
 
 But maybe it's not that hard: For example, if the memory is in current
 code made available to userspace via calling some function with an asm
 implementation that the compiler can't analyze, then this should be
 sufficient.

The kernel code would need to explicitly tell the compiler what portions
of the kernel address space were covered by this.  I would not want the
compiler to have to work it out based on observing interactions with the
page tables.  ;-)

  Agree that address-free would be nice as shall rather than should.
  
I echo Peter's question about how one tags functions like mmap().

I will also remember this for the next time someone on the committee
discounts volatile.  ;-)

  5.  JITed code produced based on BPF: 
  https://lwn.net/Articles/437981/
 
 This might be special, or not, depending on how the JITed code gets
 access to data.  If this is via fixed addresses (e.g., (void*)0x123),
 then see above.  If this is through function calls that the compiler
 can't analyze, then this is like 4.

It could well be via the kernel reading its own symbol table, sort of
a poor-person's reflection facility.  I guess that would be for all
intents and purposes equivalent to your (void*)0x123.
   
   If it is replacing code generated by the compiler, then yes.  If the JIT
   is just filling in functions that had been undefined yet declared
   before, then the compiler will have seen the data escape through the
   function interfaces, and should be aware that there is other stuff.
  
  So one other concern would then be things things like ftrace, kprobes,
  ksplice, and so on.  These rewrite the kernel binary at runtime, though
  in very limited ways.
 
 Yes.  Nonetheless, I wouldn't see a problem if they, say, rewrite with
 C11-compatible code (and same ABI) on a function granularity (and when
 the function itself isn't executing concurrently) -- this seems to be
 similar to just having another compiler compile this particular
 function.

Well, they aren't using C11-compatible code yet.  They do patch within
functions.  And in some cases, they make staged sequences of changes to
allow the patching to happen concurrently with other CPUs executing the
code being patched.  Not sure that any of the latter is actually in the
kernel at the moment, but it has at least been prototyped and discussed.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Paul E. McKenney
On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote:
 On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel trie...@redhat.com wrote:
  On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
 
  Can you point to it? Because I can find a draft standard, and it sure
  as hell does *not* contain any clarity of the model. It has a *lot* of
  verbiage, but it's pretty much impossible to actually understand, even
  for somebody who really understands memory ordering.
 
  http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
  This has an explanation of the model up front, and then the detailed
  formulae in Section 6.  This is from 2010, and there might have been
  smaller changes since then, but I'm not aware of any bigger ones.
 
 Ahh, this is different from what others pointed at. Same people,
 similar name, but not the same paper.
 
 I will read this version too, but from reading the other one and the
 standard in parallel and trying to make sense of it, it seems that I
 may have originally misunderstood part of the whole control dependency
 chain.
 
 The fact that the left side of ? :,  and || breaks data
 dependencies made me originally think that the standard tried very
 hard to break any control dependencies. Which I felt was insane, when
 then some of the examples literally were about the testing of the
 value of an atomic read. The data dependency matters quite a bit. The
 fact that the other Mathematical paper then very much talked about
 consume only in the sense of following a pointer made me think so even
 more.
 
 But reading it some more, I now think that the whole data dependency
 logic (which is where the special left-hand side rule of the ternary
 and logical operators come in) are basically an exception to the rule
 that sequence points end up being also meaningful for ordering (ok, so
 C11 seems to have renamed sequence points to sequenced before).
 
 So while an expression like
 
 atomic_read(p, consume) ? a : b;
 
 doesn't have a data dependency from the atomic read that forces
 serialization, writing
 
if (atomic_read(p, consume))
   a;
else
   b;
 
 the standard *does* imply that the atomic read is happens-before wrt
 a, and I'm hoping that there is no question that the control
 dependency still acts as an ordering point.

The control dependency should order subsequent stores, at least assuming
that a and b don't start off with identical stores that the compiler
could pull out of the if and merge.  The same might also be true for ?:
for all I know.  (But see below)

That said, in this case, you could substitute relaxed for consume and get
the same effect.  The return value from atomic_read() gets absorbed into
the if condition, so there is no dependency-ordered-before relationship,
so nothing for consume to do.

One caution...  The happens-before relationship requires you to trace a
full path between the two operations of interest.  This is illustrated
by the following example, with both x and y initially zero:

T1: atomic_store_explicit(x, 1, memory_order_relaxed);
r1 = atomic_load_explicit(y, memory_order_relaxed);

T2: atomic_store_explicit(y, 1, memory_order_relaxed);
r2 = atomic_load_explicit(x, memory_order_relaxed);

There is a happens-before relationship between T1's load and store,
and another happens-before relationship between T2's load and store,
but there is no happens-before relationship from T1 to T2, and none
in the other direction, either.  And you don't get to assume any
ordering based on reasoning about these two disjoint happens-before
relationships.

So it is quite possible for r1==1r2==1 after both threads complete.

Which should be no surprise: This misordering can happen even on x86,
which would need a full smp_mb() to prevent it.

 THAT was one of my big confusions, the discussion about control
 dependencies and the fact that the logical ops broke the data
 dependency made me believe that the standard tried to actively avoid
 the whole issue with control dependencies can break ordering
 dependencies on some CPU's due to branch prediction and memory
 re-ordering by the CPU.
 
 But after all the reading, I'm starting to think that that was never
 actually the implication at all, and the logical ops breaks the data
 dependency rule is simply an exception to the sequence point rule.
 All other sequence points still do exist, and do imply an ordering
 that matters for consume
 
 Am I now reading it right?

As long as there is an unbroken chain of -data- dependencies from the
consume to the later access in question, and as long as that chain
doesn't go through the excluded operations, yes.

 So the clarification is basically to the statement that the if
 (consume(p)) a version *would* have an ordering guarantee between the
 read of p and a, but the consume(p) ? a : b would *not* have
 such an ordering guarantee. Yes?

Neither has a data-dependency guarantee, because there is no data
dependency from the load to either a or b.  

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote:
 Several of you have said that the standard and compiler should not
 permit speculative writes of atomics, or (effectively) that the
 compiler should preserve dependencies.  In simple examples it's easy
 to see what that means, but in general it's not so clear what the
 language should guarantee, because dependencies may go via non-atomic
 code in other compilation units, and we have to consider the extent to
 which it's desirable to limit optimisation there.
 
 For example, suppose we have, in one compilation unit:
 
 void f(int ra, int*rb) {
   if (ra==42)
 *rb=42;
   else
 *rb=42;
 }

Hello, Peter!

Nice example!

The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
says the following about the control dependency in the above construct:



q = ACCESS_ONCE(a);
if (q) {
barrier();
ACCESS_ONCE(b) = p;
do_something();
} else {
barrier();
ACCESS_ONCE(b) = p;
do_something_else();
}

The initial ACCESS_ONCE() is required to prevent the compiler from
proving the value of 'a', and the pair of barrier() invocations are
required to prevent the compiler from pulling the two identical stores
to 'b' out from the legs of the if statement.



So yes, current compilers need significant help if it is necessary to
maintain dependencies in that sort of code.

Similar examples came up in the data-dependency discussions in the
standards committee, which led to the [[carries_dependency]] attribute for
C11 and C++11.  Of course, current compilers don't have this attribute,
and the current Linux kernel code doesn't have any other marking for
data dependencies passing across function boundaries.  (Maybe some time
as an assist for detecting pointer leaks out of RCU read-side critical
sections, but efforts along those lines are a bit stalled at the moment.)

More on data dependencies below...

 and in another compilation unit the bodies of two threads:
 
 // Thread 0
 r1 = x;
 f(r1,r2);
 y = r2;
 
 // Thread 1
 r3 = y;
 f(r3,r4);
 x = r4;
 
 where accesses to x and y are annotated C11 atomic
 memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
 r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
 
 (Of course, this is an artificial example, to make the point below as
 simply as possible - in real code the branches of the conditional
 might not be syntactically identical, just equivalent after macro
 expansion and other optimisation.)
 
 In the source program there's a dependency from the read of x to the
 write of y in Thread 0, and from the read of y to the write of x on
 Thread 1.  Dependency-respecting compilation would preserve those and
 the ARM and POWER architectures both respect them, so the reads of x
 and y could not give 42.
 
 But a compiler might well optimise the (non-atomic) body of f() to
 just *rb=42, making the threads effectively
 
 // Thread 0
 r1 = x;
 y = 42;
 
 // Thread 1
 r3 = y;
 x = 42;
 
 (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
 permit those two reads to see 42. That is moreover actually observable
 on current ARM hardware.

I do agree that this could happen on current compilers and hardware.

Agreed, but as Peter Zijlstra noted in this thread, this optimization
is to a control dependency, not a data dependency.

 So as far as we can see, either:
 
 1) if you can accept the latter behaviour (if the Linux codebase does
not rely on its absence), the language definition should permit it,
and current compiler optimisations can be used,
 
 or
 
 2) otherwise, the language definition should prohibit it but the
compiler would have to preserve dependencies even in compilation
units that have no mention of atomics.  It's unclear what the
(runtime and compiler development) cost of that would be in
practice - perhaps Torvald could comment?

For current compilers, we have to rely on coding conventions within
the Linux kernel in combination with non-standard extentions to gcc
and specified compiler flags to disable undesirable behavior.  I have a
start on specifying this in a document I am preparing for the standards
committee, a very early draft of which may be found here:

http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf

Section 3 shows the results of a manual scan through the Linux kernel's
dependency chains, and Section 4.1 lists a probably incomplete (and no
doubt erroneous) list of coding standards required to make dependency
chains work on current compilers.  Any comments and suggestions are more
than welcome!

 For more context, this example is taken from a summary of 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 03:33:35PM +, Peter Sewell wrote:
 Hi Paul,
 
 On 18 February 2014 14:56, Paul E. McKenney paul...@linux.vnet.ibm.com 
 wrote:
  On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote:
  Several of you have said that the standard and compiler should not
  permit speculative writes of atomics, or (effectively) that the
  compiler should preserve dependencies.  In simple examples it's easy
  to see what that means, but in general it's not so clear what the
  language should guarantee, because dependencies may go via non-atomic
  code in other compilation units, and we have to consider the extent to
  which it's desirable to limit optimisation there.
 
  For example, suppose we have, in one compilation unit:
 
  void f(int ra, int*rb) {
if (ra==42)
  *rb=42;
else
  *rb=42;
  }
 
  Hello, Peter!
 
  Nice example!
 
  The relevant portion of Documentation/memory-barriers.txt in my -rcu tree
  says the following about the control dependency in the above construct:
 
  
 
  q = ACCESS_ONCE(a);
  if (q) {
  barrier();
  ACCESS_ONCE(b) = p;
  do_something();
  } else {
  barrier();
  ACCESS_ONCE(b) = p;
  do_something_else();
  }
 
  The initial ACCESS_ONCE() is required to prevent the compiler from
  proving the value of 'a', and the pair of barrier() invocations are
  required to prevent the compiler from pulling the two identical stores
  to 'b' out from the legs of the if statement.
 
 thanks
 
  
 
  So yes, current compilers need significant help if it is necessary to
  maintain dependencies in that sort of code.
 
  Similar examples came up in the data-dependency discussions in the
  standards committee, which led to the [[carries_dependency]] attribute for
  C11 and C++11.  Of course, current compilers don't have this attribute,
  and the current Linux kernel code doesn't have any other marking for
  data dependencies passing across function boundaries.  (Maybe some time
  as an assist for detecting pointer leaks out of RCU read-side critical
  sections, but efforts along those lines are a bit stalled at the moment.)
 
  More on data dependencies below...
 
  and in another compilation unit the bodies of two threads:
 
  // Thread 0
  r1 = x;
  f(r1,r2);
  y = r2;
 
  // Thread 1
  r3 = y;
  f(r3,r4);
  x = r4;
 
  where accesses to x and y are annotated C11 atomic
  memory_order_relaxed or Linux ACCESS_ONCE(), accesses to
  r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0.
 
  (Of course, this is an artificial example, to make the point below as
  simply as possible - in real code the branches of the conditional
  might not be syntactically identical, just equivalent after macro
  expansion and other optimisation.)
 
  In the source program there's a dependency from the read of x to the
  write of y in Thread 0, and from the read of y to the write of x on
  Thread 1.  Dependency-respecting compilation would preserve those and
  the ARM and POWER architectures both respect them, so the reads of x
  and y could not give 42.
 
  But a compiler might well optimise the (non-atomic) body of f() to
  just *rb=42, making the threads effectively
 
  // Thread 0
  r1 = x;
  y = 42;
 
  // Thread 1
  r3 = y;
  x = 42;
 
  (GCC does this at O1, O2, and O3) and the ARM and POWER architectures
  permit those two reads to see 42. That is moreover actually observable
  on current ARM hardware.
 
  I do agree that this could happen on current compilers and hardware.
 
  Agreed, but as Peter Zijlstra noted in this thread, this optimization
  is to a control dependency, not a data dependency.
 
 Indeed.  In principle (again as Hans has observed) a compiler might
 well convert between the two, e.g. if operating on single-bit values,
 or where value-range analysis has shown that a variable can only
 contain one of a small set of values.   I don't know whether that
 happens in practice?  Then there are also cases where a compiler is
 very likely to remove data/address dependencies, eg if some constant C
 is #define'd to be 0 then an array access indexed by x * C will have
 the dependency on x removed.  The runtime and compiler development
 costs of preventing that are also unclear to me.
 
 Given that, whether it's reasonable to treat control and data
 dependencies differently seems to be an open question.

Here is another (admittedly fanciful and probably buggy) implementation
of f() that relies on data dependencies (according to C11 and C++11),
but which could not be relied on to preserve thosse data dependencies
given current pre-C11 compilers:

int arr[2] = { 42, 43 };
int *bigarr;

int f

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 04:56:40PM +0100, Torvald Riegel wrote:
 On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote:
  On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote:
   On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel trie...@redhat.com 
   wrote:
   
Which example do you have in mind here?  Haven't we resolved all the
debated examples, or did I miss any?
   
   Well, Paul seems to still think that the standard possibly allows
   speculative writes or possibly value speculation in ways that break
   the hardware-guaranteed orderings.
  
  It is not that I know of any specific problems, but rather that I
  know I haven't looked under all the rocks.  Plus my impression from
  my few years on the committee is that the standard will be pushed to
  the limit when it comes time to add optimizations.
  
  One example that I learned about last week uses the branch-prediction
  hardware to validate value speculation.  And no, I am not at all a fan
  of value speculation, in case you were curious.  However, it is still
  an educational example.
  
  This is where you start:
  
  p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
  do_something(p-a, p-b, p-c);
  p-d = 1;
 
 I assume that's the source code.

Yep!

  Then you leverage branch-prediction hardware as follows:
  
  p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */
  if (p == GUESS) {
  do_something(GUESS-a, GUESS-b, GUESS-c);
  GUESS-d = 1;
  } else {
  do_something(p-a, p-b, p-c);
  p-d = 1;
  }
 
 I assume that this is a potential transformation by a compiler.

Again, yep!

  The CPU's branch-prediction hardware squashes speculation in the case where
  the guess was wrong, and this prevents the speculative store to -d from
  ever being visible.  However, the then-clause breaks dependencies, which
  means that the loads -could- be speculated, so that do_something() gets
  passed pre-initialization values.
  
  Now, I hope and expect that the wording in the standard about dependency
  ordering prohibits this sort of thing.  But I do not yet know for certain.
 
 The transformation would be incorrect.  p-a in the source code carries
 a dependency, and as you say, the transformed code wouldn't have that
 dependency any more.  So the transformed code would loose ordering
 constraints that it has in the virtual machine, so in the absence of
 other proofs of correctness based on properties not shown in the
 example, the transformed code would not result in the same behavior as
 allowed by the abstract machine.

Glad that you agree!  ;-)

 If the transformation would actually be by a programmer, then this
 wouldn't do the same as the first example because mo_consume doesn't
 work through the if statement.

Agreed.

 Are there other specified concerns that you have regarding this example?

Nope.  Just generalized paranoia.  (But just because I am paranoid
doesn't mean that there isn't a bug lurking somewhere in the standard,
the compiler, the kernel, or my own head!)

I will likely have more once I start mapping Linux kernel atomics to the
C11 standard.  One more paper past N3934 comes first, though.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote:
 On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote:
  On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel trie...@redhat.com wrote:
  
   There's an underlying problem here that's independent from the actual
   instance that you're worried about here: no sense is a ultimately a
   matter of taste/objectives/priorities as long as the respective
   specification is logically consistent.
  
  Yes. But I don't think it's independent.
  
  Exactly *because* some people will read standards without applying
  does the resulting code generation actually make sense for the
  programmer that wrote the code, the standard has to be pretty clear.
  
  The standard often *isn't* pretty clear. It wasn't clear enough when
  it came to volatile, and yet that was a *much* simpler concept than
  atomic accesses and memory ordering.
  
  And most of the time it's not a big deal. But because the C standard
  generally tries to be very portable, and cover different machines,
  there tends to be a mindset that anything inherently unportable is
  undefined or implementation defined, and then the compiler writer
  is basically given free reign to do anything they want (with
  implementation defined at least requiring that it is reliably the
  same thing).
 
 Yes, that's how it works in general.  And this makes sense, because all
 optimizations rely on that.  Second, you can't keep something consistent
 (eg, between compilers) if it isn't specified.  So if we want stricter
 rules, those need to be specified somewhere.
 
  And when it comes to memory ordering, *everything* is basically
  non-portable, because different CPU's very much have different rules.
 
 Well, the current set of memory orders (and the memory model as a whole)
 is portable, even though it might not allow to exploit all hardware
 properties, and thus might perform sub-optimally in some cases.
 
  I worry that that means that the standard then takes the stance that
  well, compiler re-ordering is no worse than CPU re-ordering, so we
  let the compiler do anything. And then we have to either add
  volatile to make sure the compiler doesn't do that, or use an overly
  strict memory model at the compiler level that makes it all pointless.
 
 Using volatile is not a good option, I think, because synchronization
 between threads should be orthogonal to observable output of the
 abstract machine.

Are you thinking of volatile -instead- of atomics?  My belief is that
given the current standard there will be times that we need to use
volatile -in- -addition- to atomics.

 The current memory model might not allow to exploit all hardware
 properties, I agree.
 
 But then why don't we work on how to extend it to do so?  We need to
 specify the behavior we want anyway, and this can't be independent of
 the language semantics, so it has to be conceptually integrated with the
 standard anyway.
 
  So I really really hope that the standard doesn't give compiler
  writers free hands to do anything that they can prove is equivalent
  in the virtual C machine model.
 
 It does, but it also doesn't mean this can't be extended.  So let's
 focus on whether we can find an extension.
 
  That's not how you get reliable
  results.
 
 In this general form, that's obviously a false claim.

These two sentences starkly illustrate the difference in perspective
between you two.  You are talking past each other.  Not sure how to fix
this at the moment, but what else is new?  ;-)

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
 On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com wrote:
  On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
  And exactly because I know enough, I would *really* like atomics to be
  well-defined, and have very clear - and *local* - rules about how they
  can be combined and optimized.
 
  Local?
 
 Yes.
 
 So I think that one of the big advantages of atomics over volatile is
 that they *can* be optimized, and as such I'm not at all against
 trying to generate much better code than for volatile accesses.
 
 But at the same time, that can go too far. For example, one of the
 things we'd want to use atomics for is page table accesses, where it
 is very important that we don't generate multiple accesses to the
 values, because parts of the values can be change *by*hardware* (ie
 accessed and dirty bits).
 
 So imagine that you have some clever global optimizer that sees that
 the program never ever actually sets the dirty bit at all in any
 thread, and then uses that kind of non-local knowledge to make
 optimization decisions. THAT WOULD BE BAD.

Might as well list other reasons why value proofs via whole-program
analysis are unreliable for the Linux kernel:

1.  As Linus said, changes from hardware.

2.  Assembly code that is not visible to the compiler.
Inline asms will -normally- let the compiler know what
memory they change, but some just use the memory tag.
Worse yet, I suspect that most compilers don't look all
that carefully at .S files.

Any number of other programs contain assembly files.

3.  Kernel modules that have not yet been written.  Now, the
compiler could refrain from trying to prove anything about
an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
is currently no way to communicate this information to the
compiler other than marking the variable volatile.

Other programs have similar issues, e.g., via dlopen().

4.  Some drivers allow user-mode code to mmap() some of their
state.  Any changes undertaken by the user-mode code would
be invisible to the compiler.

5.  JITed code produced based on BPF: https://lwn.net/Articles/437981/

And probably other stuff as well.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 03:16:33PM +, Mark Batty wrote:
 Hi Paul,
 
 Thanks for the document. I'm looking forward to reading the bits about
 dependency chains in Linux.

And I am looking forward to your thoughts on those bits!

  One point of confusion for me...  Example 4 says language must allow.
  Shouldn't that be language is permitted to allow?
 
 When we say allow, we mean that the optimised execution should be
 allowed by the specification, and Implicitly, the unoptimised
 execution should remain allowed too. We want to be concrete about what
 the language specification allows, and that's why we say must. It is
 not to disallow the unoptimised execution.

OK, got it!

Thanx, Paul

  Seems like an
  implementation is always within its rights to avoid an optimization if
  its implementation prevents it from safely detecting the oppportunity
  for that optimization.
 
 That's right.
 
 - Mark
 
 
  Or am I missing something here?
 
  Thanx, Paul
 
 



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 09:44:48AM -0800, Linus Torvalds wrote:
 On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel trie...@redhat.com wrote:

[ . . . ]

  The standard is clear on what's required.  I strongly suggest reading
  the formalization of the memory model by Batty et al.
 
 Can you point to it? Because I can find a draft standard, and it sure
 as hell does *not* contain any clarity of the model. It has a *lot* of
 verbiage, but it's pretty much impossible to actually understand, even
 for somebody who really understands memory ordering.

I suspect he is thinking of the following:

Mathematizing C++ Concurrency. Mark Batty, Scott Owens, Susmit Sarkar,
Peter Sewell, and Tjark Weber.
https://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf

Even if you don't like the math, it contains some very good examples.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 06:23:47PM +, Peter Sewell wrote:
 On 18 February 2014 17:16, Paul E. McKenney paul...@linux.vnet.ibm.com 
 wrote:
  On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
  On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com wrote:
   On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
   And exactly because I know enough, I would *really* like atomics to be
   well-defined, and have very clear - and *local* - rules about how they
   can be combined and optimized.
  
   Local?
 
  Yes.
 
  So I think that one of the big advantages of atomics over volatile is
  that they *can* be optimized, and as such I'm not at all against
  trying to generate much better code than for volatile accesses.
 
  But at the same time, that can go too far. For example, one of the
  things we'd want to use atomics for is page table accesses, where it
  is very important that we don't generate multiple accesses to the
  values, because parts of the values can be change *by*hardware* (ie
  accessed and dirty bits).
 
  So imagine that you have some clever global optimizer that sees that
  the program never ever actually sets the dirty bit at all in any
  thread, and then uses that kind of non-local knowledge to make
  optimization decisions. THAT WOULD BE BAD.
 
  Might as well list other reasons why value proofs via whole-program
  analysis are unreliable for the Linux kernel:
 
  1.  As Linus said, changes from hardware.
 
  2.  Assembly code that is not visible to the compiler.
  Inline asms will -normally- let the compiler know what
  memory they change, but some just use the memory tag.
  Worse yet, I suspect that most compilers don't look all
  that carefully at .S files.
 
  Any number of other programs contain assembly files.
 
  3.  Kernel modules that have not yet been written.  Now, the
  compiler could refrain from trying to prove anything about
  an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
  is currently no way to communicate this information to the
  compiler other than marking the variable volatile.
 
  Other programs have similar issues, e.g., via dlopen().
 
  4.  Some drivers allow user-mode code to mmap() some of their
  state.  Any changes undertaken by the user-mode code would
  be invisible to the compiler.
 
  5.  JITed code produced based on BPF: https://lwn.net/Articles/437981/
 
  And probably other stuff as well.
 
 interesting list.  So are you saying that value-range-analysis and
 such-like (I say glibly, without really knowing what such-like
 refers to here) are fundamentally incompatible with
 the kernel code, or can you think of some way to tell the compiler a
 bound on the footprint of the unseen changes in each of those cases?

Other than the volatile keyword, no.

Well, I suppose you could also create a function that changed the
variables in question, then arrange to never call it, but in such a way
that the compiler could not prove that it was never called.  But ouch!

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 10:49:27AM -0800, Linus Torvalds wrote:
 On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell
 peter.sew...@cl.cam.ac.uk wrote:
 
  This is a bit more subtle, because (on ARM and POWER) removing the
  dependency and conditional branch is actually in general *not* equivalent
  in the hardware, in a concurrent context.
 
 So I agree, but I think that's a generic issue with non-local memory
 ordering, and is not at all specific to the optimization wrt that
 x?42:42 expression.
 
 If you have a value that you loaded with a non-relaxed load, and you
 pass that value off to a non-local function that you don't know what
 it does, in my opinion that implies that the compiler had better add
 the necessary serialization to say whatever that other function does,
 we guarantee the semantics of the load.
 
 So on ppc, if you do a load with consume or acquire and then call
 another function without having had something in the caller that
 serializes the load, you'd better add the lwsync or whatever before
 the call. Exactly because the function call itself otherwise basically
 breaks the visibility into ordering. You've basically turned a
 load-with-ordering-guarantees into just an integer that you passed off
 to something that doesn't know about the ordering guarantees - and you
 need that lwsync in order to still guarantee the ordering.
 
 Tough titties. That's what a CPU with weak memory ordering semantics
 gets in order to have sufficient memory ordering.

And that is in fact what C11 compilers are supposed to do if the function
doesn't have the [[carries_dependency]] attribute on the corresponding
argument or return of the non-local function.  If the function is marked
with [[carries_dependency]], then the compiler has the information needed
in both compilations to make things work correctly.

Thanx, Paul

 And I don't think it's actually a problem in practice. If you are
 doing loads with ordered semantics, you're not going to pass the
 result off willy-nilly to random functions (or you really *do* require
 the ordering, because the load that did the acquire was actually for
 a lock!
 
 So I really think that the local optimization is correct regardless.
 
Linus
 



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 09:43:31PM +0100, Torvald Riegel wrote:
 xagsmtp5.20140218204423.3...@bldgate.vnet.ibm.com
 X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP5 at BLDGATE)
 
 On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote:
  Several of you have said that the standard and compiler should not
  permit speculative writes of atomics, or (effectively) that the
  compiler should preserve dependencies.  In simple examples it's easy
  to see what that means, but in general it's not so clear what the
  language should guarantee, because dependencies may go via non-atomic
  code in other compilation units, and we have to consider the extent to
  which it's desirable to limit optimisation there.
 
 [...]
 
  2) otherwise, the language definition should prohibit it but the
 compiler would have to preserve dependencies even in compilation
 units that have no mention of atomics.  It's unclear what the
 (runtime and compiler development) cost of that would be in
 practice - perhaps Torvald could comment?
 
 If I'm reading the standard correctly, it requires that data
 dependencies are preserved through loads and stores, including nonatomic
 ones.  That sounds convenient because it allows programmers to use
 temporary storage.
 
 However, what happens if a dependency arrives at a store for which the
 alias set isn't completely known?  Then we either have to add a barrier
 to enforce the ordering at this point, or we have to assume that all
 other potentially aliasing memory locations would also have to start
 carrying dependencies (which might be in other functions in other
 compilation units).  Neither option is good.  The first might introduce
 barriers in places in which they might not be required (or the
 programmer has to use kill_dependency() quite often to avoid all these).
 The second is bad because points-to analysis is hard, so in practice the
 points-to set will not be precisely known for a lot of pointers.  So
 this might not just creep into other functions via calls of
 [[carries_dependency]] functions, but also through normal loads and
 stores, likely prohibiting many optimizations.

I cannot immediately think of a situation where a store carrying a
dependency into a non-trivially aliased object wouldn't be a usage
error, so perhaps emitting a barrier and a diagnostic at that point
is best.

 Furthermore, the dependency tracking can currently only be
 disabled/enabled on a function granularity (via
 [[carries_dependency]]).  Thus, if we have big functions, then
 dependency tracking may slow down a lot of code in the big function.  If
 we have small functions, there's a lot of attributes to be added.
 
 If a function may only carry a dependency but doesn't necessarily (eg,
 depending on input parameters), then the programmer has to make a
 trade-off whether he/she want's to benefit from mo_consume but slow down
 other calls due to additional barriers (ie, when this function is called
 from non-[[carries_dependency]] functions), or vice versa.  (IOW,
 because of the function granularity, other code's performance is
 affected.)
 
 If a compiler wants to implement dependency tracking just for a few
 constructs (e.g., operators - + ...) and use barriers otherwise, then
 this decision must be compatible with how all this is handled in other
 compilation units.  Thus, compiler optimizations effectively become part
 of the ABI, which doesn't seem right.
 
 I hope these examples illustrate my concerns about the implementability
 in practice of this.  It's also why I've suggested to move from an
 opt-out approach as in the current standard (ie, with kill_dependency())
 to an opt-in approach for conservative dependency tracking (e.g., with a
 preserve_dependencies(exp) call, where exp will not be optimized in a
 way that removes any dependencies).  This wouldn't help with many
 optimizations being prevented, but it should at least help programmers
 contain the problem to smaller regions of code.
 
 I'm not aware of any implementation that tries to track dependencies, so
 I can't give any real performance numbers.  This could perhaps be
 simulated, but I'm not sure whether a realistic case would be made
 without at least supporting [[carries_dependency]] properly in the
 compiler, which would be some work.

Another approach would be to use start-tracking/stop-tracking directives
that could be buried into rcu_read_lock() and rcu_read_unlock().  There
are issues with nesting and conditional use of rcu_read_lock() and
rcu_read_unlock(), but it does give you nicer granularity properties.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Paul E. McKenney
On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote:
 xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com
 X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9)
 
 On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote:
  On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote:
   On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com 
   wrote:
On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
And exactly because I know enough, I would *really* like atomics to be
well-defined, and have very clear - and *local* - rules about how they
can be combined and optimized.
   
Local?
   
   Yes.
   
   So I think that one of the big advantages of atomics over volatile is
   that they *can* be optimized, and as such I'm not at all against
   trying to generate much better code than for volatile accesses.
   
   But at the same time, that can go too far. For example, one of the
   things we'd want to use atomics for is page table accesses, where it
   is very important that we don't generate multiple accesses to the
   values, because parts of the values can be change *by*hardware* (ie
   accessed and dirty bits).
   
   So imagine that you have some clever global optimizer that sees that
   the program never ever actually sets the dirty bit at all in any
   thread, and then uses that kind of non-local knowledge to make
   optimization decisions. THAT WOULD BE BAD.
  
  Might as well list other reasons why value proofs via whole-program
  analysis are unreliable for the Linux kernel:
  
  1.  As Linus said, changes from hardware.
 
 This is what's volatile is for, right?  (Or the weak-volatile idea I
 mentioned).
 
 Compilers won't be able to prove something about the values of such
 variables, if marked (weak-)volatile.

Yep.

  2.  Assembly code that is not visible to the compiler.
  Inline asms will -normally- let the compiler know what
  memory they change, but some just use the memory tag.
  Worse yet, I suspect that most compilers don't look all
  that carefully at .S files.
  
  Any number of other programs contain assembly files.
 
 Are the annotations of changed memory really a problem?  If the memory
 tag exists, isn't that supposed to mean all memory?
 
 To make a proof about a program for location X, the compiler has to
 analyze all uses of X.  Thus, as soon as X escapes into an .S file, then
 the compiler will simply not be able to prove a thing (except maybe due
 to the data-race-free requirement for non-atomics).  The attempt to
 prove something isn't unreliable, simply because a correct compiler
 won't claim to be able to prove something.

I am indeed less worried about inline assembler than I am about files
full of assembly.  Or files full of other languages.

 One reason that could corrupt this is that if program addresses objects
 other than through the mechanisms defined in the language.  For example,
 if one thread lays out a data structure at a constant fixed memory
 address, and another one then uses the fixed memory address to get
 access to the object with a cast (e.g., (void*)0x123).

Or if the program uses gcc linker scripts to get the same effect.

  3.  Kernel modules that have not yet been written.  Now, the
  compiler could refrain from trying to prove anything about
  an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there
  is currently no way to communicate this information to the
  compiler other than marking the variable volatile.
 
 Even if the variable is just externally accessible, then the compiler
 knows that it can't do whole-program analysis about it.
 
 It is true that whole-program analysis will not be applicable in this
 case, but it will not be unreliable.  I think that's an important
 difference.

Let me make sure that I understand what you are saying.  If my program has
extern int foo;, the compiler will refrain from doing whole-program
analysis involving foo?  Or to ask it another way, when you say
whole-program analysis, are you restricting that analysis to the
current translation unit?

If so, I was probably not the only person thinking that you instead meant
analysis across all translation units linked into the program.  ;-)

  Other programs have similar issues, e.g., via dlopen().
  
  4.  Some drivers allow user-mode code to mmap() some of their
  state.  Any changes undertaken by the user-mode code would
  be invisible to the compiler.
 
 A good point, but a compiler that doesn't try to (incorrectly) assume
 something about the semantics of mmap will simply see that the mmap'ed
 data will escape to stuff if can't analyze, so it will not be able to
 make a proof.
 
 This is different from, for example, malloc(), which is guaranteed to
 return fresh nonaliasing memory.

As Peter noted, this is the other end of mmap().  The -user- code sees
that there is an mmap(), but the kernel code invokes functions that
poke values into hardware registers

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Paul E. McKenney
On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote:
 On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
  You need volatile semantics to force the compiler to ignore any proofs
  it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
  calls in my email to Torvald.  (Hopefully I translated your example
  reasonably.)
 
 My brain gave out for today; but it did appear to have the right
 structure.

I can relate.  ;-)

 I would prefer it C11 would not require the volatile casts. It should
 simply _never_ speculate with atomic writes, volatile or not.

I agree with not needing volatiles to prevent speculated writes.  However,
they will sometimes be needed to prevent excessive load/store combining.
The compiler doesn't have the runtime feedback mechanisms that the
hardware has, and thus will need help from the developer from time
to time.

Or maybe the Linux kernel simply waits to transition to C11 relaxed atomics
until the compiler has learned to be sufficiently conservative in its
load-store combining decisions.

Thanx, Paul



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Paul E. McKenney
On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote:
 On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote:
  On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel trie...@redhat.com wrote:
  
   I think a major benefit of C11's memory model is that it gives a
   *precise* specification for how a compiler is allowed to optimize.
  
  Clearly it does *not*. This whole discussion is proof of that. It's
  not at all clear,
 
 It might not be an easy-to-understand specification, but as far as I'm
 aware it is precise.  The Cambridge group's formalization certainly is
 precise.  From that, one can derive (together with the usual rules for
 as-if etc.) what a compiler is allowed to do (assuming that the standard
 is indeed precise).  My replies in this discussion have been based on
 reasoning about the standard, and not secret knowledge (with the
 exception of no-out-of-thin-air, which is required in the standard's
 prose but not yet formalized).
 
 I agree that I'm using the formalization as a kind of placeholder for
 the standard's prose (which isn't all that easy to follow for me
 either), but I guess there's no way around an ISO standard using prose.
 
 If you see a case in which the standard isn't precise, please bring it
 up or open a C++ CWG issue for it.

I suggest that I go through the Linux kernel's requirements for atomics
and memory barriers and see how they map to C11 atomics.  With that done,
we would have very specific examples to go over.  Without that done, the
discussion won't converge very well.

Seem reasonable?

Thanx, Paul

  and the standard apparently is at least debatably
  allowing things that shouldn't be allowed.
 
 Which example do you have in mind here?  Haven't we resolved all the
 debated examples, or did I miss any?
 
  It's also a whole lot more
  complicated than volatile, so the likelihood of a compiler writer
  actually getting it right - even if the standard does - is lower.
 
 It's not easy, that's for sure, but none of the high-performance
 alternatives are easy either.  There are testing tools out there based
 on the formalization of the model, and we've found bugs with them.
 
 And the alternative of using something not specified by the standard is
 even worse, I think, because then you have to guess what a compiler
 might do, without having any constraints; IOW, one is resorting to no
 sane compiler would do that, and that doesn't seem to very robust
 either.
 
 



Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Paul E. McKenney
On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote:
 On February 17, 2014 7:18:15 PM GMT+01:00, Paul E. McKenney 
 paul...@linux.vnet.ibm.com wrote:
 On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote:
  On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
   You need volatile semantics to force the compiler to ignore any
 proofs
   it might otherwise attempt to construct.  Hence all the
 ACCESS_ONCE()
   calls in my email to Torvald.  (Hopefully I translated your example
   reasonably.)
  
  My brain gave out for today; but it did appear to have the right
  structure.
 
 I can relate.  ;-)
 
  I would prefer it C11 would not require the volatile casts. It should
  simply _never_ speculate with atomic writes, volatile or not.
 
 I agree with not needing volatiles to prevent speculated writes. 
 However,
 they will sometimes be needed to prevent excessive load/store
 combining.
 The compiler doesn't have the runtime feedback mechanisms that the
 hardware has, and thus will need help from the developer from time
 to time.
 
 Or maybe the Linux kernel simply waits to transition to C11 relaxed
 atomics
 until the compiler has learned to be sufficiently conservative in its
 load-store combining decisions.
 
 Sounds backwards. Currently the compiler does nothing to the atomics. I'm 
 sure we'll eventually add something. But if testing coverage is zero outside 
 then surely things get worse, not better with time.

Perhaps we solve this chicken-and-egg problem by creating a test suite?

Thanx, Paul



  1   2   >