Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics
On Tue, Jul 04, 2023 at 04:25:45PM -0400, Alan Stern wrote: > On Tue, Jul 04, 2023 at 01:19:23PM -0400, Olivier Dion wrote: [ . . . ] > > I am puzzled by this. Initialization of a shared variable does not need > > to be atomic until its publication. Could you expand on this? > > In the kernel, I believe it sometimes happens that an atomic variable > may be published before it is initialized. (If that's wrong, Paul or > Peter can correct me.) But since this doesn't apply to the situations > you're concerned with, you can forget I mentioned it. Both use cases exist. A global atomic is implicitly published at compile time. If the desired initial value is not known until multiple threads are running, then it is necessary to be careful. Hence double-check locking and its various replacements. (Clearly, if you can determine the initial value before going multithreaded, life is simpler.) And dynamically allocated or on-stack storage is the other case, where there is a point in time when the storage is private even after multiple threads are running. Or am I missing the point? Thanx, Paul
Re: [PATCH] Adding _Dependent_ptr type qualifier in C part 1/3
And please see the following URL for an update: http://www.rdrop.com/~paulmck/submission/D%20_Dependent_ptr%20to%20simplify%20carries%20a%20dependency.pdf This adds a description of conversions, warnings, and patches. Thanx, Paul On Sun, Aug 04, 2019 at 04:11:12PM -0700, Paul E. McKenney wrote: > Good points! > > On the type-qualifier interactions, here is an initial list. Thoughts? > > Thanx, Paul > > _Alignas: Dependency-breaking optimizations would be avoided, and the > variable would be aligned as specified. > > _Atomic: Dependency-breaking optimizations would be avoided, and the > variable would be accessed using C11 atomics. > > const: This is not particularly useful for variables with static storage > duration because compile-time initialization does not require dependency > ordering, but then again, use of _Dependent_ptr on such variables is > suspect to begin with. Otherwise, the const _Dependent_ptr variable > would normally be initialized from another _Dependent_ptr variable or > from a memory_order_consume load. The variable would disallow further > stores and avoid breaking dependencies. > > extern: Dependency-breaking optimizations would be avoided, and the > variable would be usable within other translation units. This is > also an unusual addition to a _Dependent_ptr unless also accompanied by > _Thread_local because there are no known non-suspect multi-threaded-access > use cases for _Dependent_ptr. > > register: Dependency-breaking optimizations would be avoided, and the > compiler would be given a hint to keep the variable in a register. > > restrict: Dependency-breaking optimizations would be avoided, and the > compiler may assume that the pointed-to object is only accessed through > this pointer and through pointers derived from it. > > static: Dependency-breaking optimizations would be avoided, and the > variable would be static. This is also an unusual addition to a > _Dependent_ptr unless also accompanied by _Thread_local because there are > no known non-suspect multi-threaded-access use cases for _Dependent_ptr. > > _Thread_local: The dependency-carrying variable is thread-local, and > avoids dependency-breaking optimizations. > > volatile: All accesses would be executed as per the abstract machine, > and dependency-breaking optimizations would be avoided. > > On Fri, Aug 02, 2019 at 08:30:49PM -0600, Martin Sebor wrote: > > On 8/1/19 9:26 AM, Paul McKenney wrote: > > >Excellent point, this discussion needs to be made official. > > >Please see attached for an initial draft of a working paper. > > > > > >Thoughts? > > > > The draft is a start but it (obviously) needs a lot of work to cover > > the constraints and semantics so I'll just comment on the patch in > > a bit more detail. > > > > Since the feature is being (or will be) proposed to WG14 and could > > change I would expect it to be only available under -std=c2x and > > disabled otherwise, so that code that makes use of it does so with > > the understanding that it's only experimental. (I just noticed > > Akshat email with a tweak to the patch. I'm not sure that issuing > > a pedantic warning and having the name in the implementation namepace > > is enough but others (the C FE maintainers) will know better.) > > > > Other than that, I would also expect to see more extensive test > > coverage, at a minimum to exercise error detection (invalid uses, > > conversions, etc.). For example, I see the code that rejects it > > but no tests for declaring, say, a _Dependent_ptr-qualified integer. > > What I don't think I see is code that rejects _Dependent_ptr-qualified > > function pointers (AFAIK, POINTER_TYPE_P evaluates to nonzero for both). > > Is that intended? (And if so, what does it mean?)I also see that > > the new tests look for warnings but it's not clear to me that that's > > their intent or that the dg-warning directives are being used to > > "suppress" warnings for issues in the tests. For instance, this > > is common: > > > > rcu_assign_pointer (gp,p);/* { dg-warning > > "\\\[-Wincompatible-pointer-types]" } */ > > > > but neither gp nor p is a _Dependent_ptr. (I may be missing > > the purpose of the compile-only tests. E.g., the only thing > > p0190r4_fig10.c seems to be exercising besides that the _Dependent_ptr > > qualifier is accepted is a couple of warnings, one of which is the one > > above.) I would suggest to look at tests for other qualifiers f
Re: [PATCH] Adding _Dependent_ptr type qualifier in C part 1/3
Good points! On the type-qualifier interactions, here is an initial list. Thoughts? Thanx, Paul _Alignas: Dependency-breaking optimizations would be avoided, and the variable would be aligned as specified. _Atomic: Dependency-breaking optimizations would be avoided, and the variable would be accessed using C11 atomics. const: This is not particularly useful for variables with static storage duration because compile-time initialization does not require dependency ordering, but then again, use of _Dependent_ptr on such variables is suspect to begin with. Otherwise, the const _Dependent_ptr variable would normally be initialized from another _Dependent_ptr variable or from a memory_order_consume load. The variable would disallow further stores and avoid breaking dependencies. extern: Dependency-breaking optimizations would be avoided, and the variable would be usable within other translation units. This is also an unusual addition to a _Dependent_ptr unless also accompanied by _Thread_local because there are no known non-suspect multi-threaded-access use cases for _Dependent_ptr. register: Dependency-breaking optimizations would be avoided, and the compiler would be given a hint to keep the variable in a register. restrict: Dependency-breaking optimizations would be avoided, and the compiler may assume that the pointed-to object is only accessed through this pointer and through pointers derived from it. static: Dependency-breaking optimizations would be avoided, and the variable would be static. This is also an unusual addition to a _Dependent_ptr unless also accompanied by _Thread_local because there are no known non-suspect multi-threaded-access use cases for _Dependent_ptr. _Thread_local: The dependency-carrying variable is thread-local, and avoids dependency-breaking optimizations. volatile: All accesses would be executed as per the abstract machine, and dependency-breaking optimizations would be avoided. On Fri, Aug 02, 2019 at 08:30:49PM -0600, Martin Sebor wrote: > On 8/1/19 9:26 AM, Paul McKenney wrote: > >Excellent point, this discussion needs to be made official. > >Please see attached for an initial draft of a working paper. > > > >Thoughts? > > The draft is a start but it (obviously) needs a lot of work to cover > the constraints and semantics so I'll just comment on the patch in > a bit more detail. > > Since the feature is being (or will be) proposed to WG14 and could > change I would expect it to be only available under -std=c2x and > disabled otherwise, so that code that makes use of it does so with > the understanding that it's only experimental. (I just noticed > Akshat email with a tweak to the patch. I'm not sure that issuing > a pedantic warning and having the name in the implementation namepace > is enough but others (the C FE maintainers) will know better.) > > Other than that, I would also expect to see more extensive test > coverage, at a minimum to exercise error detection (invalid uses, > conversions, etc.). For example, I see the code that rejects it > but no tests for declaring, say, a _Dependent_ptr-qualified integer. > What I don't think I see is code that rejects _Dependent_ptr-qualified > function pointers (AFAIK, POINTER_TYPE_P evaluates to nonzero for both). > Is that intended? (And if so, what does it mean?)I also see that > the new tests look for warnings but it's not clear to me that that's > their intent or that the dg-warning directives are being used to > "suppress" warnings for issues in the tests. For instance, this > is common: > > rcu_assign_pointer (gp,p); /* { dg-warning > "\\\[-Wincompatible-pointer-types]" } */ > > but neither gp nor p is a _Dependent_ptr. (I may be missing > the purpose of the compile-only tests. E.g., the only thing > p0190r4_fig10.c seems to be exercising besides that the _Dependent_ptr > qualifier is accepted is a couple of warnings, one of which is the one > above.) I would suggest to look at tests for other qualifiers for > examples and model the new ones after those. > > I'm also wondering how the new qualifier interacts with others like > const. Should all combinations of qualifiers be accepted and do > they affect the semantics in any interesting way? This is probably > something to cover in the proposal. > > Martin > > > > > Thanx, Paul > > > >On Tue, Jul 30, 2019 at 12:46 PM Martin Sebor wrote: > >> > >>On 7/30/19 1:13 AM, Akshat Garg wrote: > >>>Hi, > >>>This patch includes C front-end code for a type qualifier _Dependent_ptr. > >> > >>Just some very high-level comments/questions. I only followed > >>the _Dependent_ptr discussion from a distance and I'm likely > >>missing some context so the first thing I looked for in this > >>patch is documentation of the new qualifier. Unless it's > >>a proposed C2x feature that I missed I would expect to find it > >>in section 6 - Extensions to
Re: Doubts regarding the _Dependent_ptr keyword
On Thu, Jul 04, 2019 at 01:00:18PM +0200, Richard Biener wrote: > On Wed, Jul 3, 2019 at 6:33 PM Paul E. McKenney wrote: > > > > On Wed, Jul 03, 2019 at 05:47:56PM +0200, Richard Biener wrote: > > > On July 3, 2019 5:14:58 PM GMT+02:00, "Paul E. McKenney" > > > wrote: > > > >On Wed, Jul 03, 2019 at 12:39:41AM +0530, Akshat Garg wrote: > > > >> On Tue, Jul 2, 2019 at 8:40 PM Paul E. McKenney > > > > > > > >> wrote: > > > >> > > > >> > On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan > > > >wrote: > > > >> > > On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney > > > > > > > >> > wrote: > > > >> > > > > > >> > > > > > > >> > > > Once a user-created non-dependent pointer is assigned to, it is > > > >OK to > > > >> > > > break the dependency. > > > >> > > > > > >> > > Ok, that's good. > > > >> > > > > > > >> > > > Or am I missing the point here? > > > >> > > > > > >> > > I was just trying to make sure we were on the same page. I wonder > > > >if > > > >> > > marking this volatile would be sufficient for prototyping. I > > > >suspect > > > >> > > we would need another flag somewhere which someone with gimple > > > >> > > knowledge might be able to help us with. > > > >> > > > > >> > I expect that marking it as volatile would do the trick. ;-) > > > >> > > > > >> > Thanx, Paul > > > >> > > > > >> So, marking this pointer as volatile will not allow the compiler to > > > >> modify/optimize the statements, the pointer is appearing in. And we > > > >don't > > > >> need to push any other code inside any of the passes. Due to this, we > > > >want > > > >> to automatically say those dependent pointers are volatile and > > > >introduce a > > > >> new flag for this. Am I getting you guys correctly? Kindly, let me > > > >know? > > > > > > > >While I suspect that this might work, it would suppress way more > > > >optimizations than would be good. For but one example, consider: > > > > > > > > _Dependent_ptr int *p; > > > > > > > > p = atomic_load_explicit(gp, memory_order_consume); > > > > a = p->a; > > > > b = p->b; > > > > > > > >If "p" is volatile, then the compiler will be prevented from keeping > > > >it in a register, which would not make people coding fastpaths at > > > >all happy. ;-) > > > > > > > >Still, use of volatile might be a good technique for prototyping and > > > >analysis of _Dependent_ptr. > > > > > > With this example can you quickly summarize what kind of guarantees > > > _Dependent_ptr gives and how a compiler > > > Could possibly break those? > > > > First I suppose I should fix the bug in the above code. Or one of the > > bugs, at least. :-/ > > > > struct foo { > > int a; > > int b; > > }; > > > > _Dependent_ptr struct foo *p; > > > > p = atomic_load_explicit(gp, memory_order_consume); > > a = p->a; > > b = p->b; > > > > And then let me tweak the example a bit. For the first tweak: > > > > struct foo { > > int a; > > int b; > > }; > > > > struct foo default_foo = { .a = 42, .b = 43 }; > > int *gp = _foo; > > > > ... > > > > _Dependent_ptr int *p; > > > > p = atomic_load_explicit(gp, memory_order_consume); > > a = p->a; > > b = p->b; > > > > Suppose that the compiler used feedback-driven optimization, and noticed > > that the value of gp was almost always _foo. The compiler might > > decide to transform the last three lines as follows: > > > > p = atomic_load_explicit(gp, memory_order_consume); > > if (p == _foo) { > > a = default_foo.a; > > b = default_foo.b; > >
Re: Doubts regarding the _Dependent_ptr keyword
On Wed, Jul 03, 2019 at 05:47:56PM +0200, Richard Biener wrote: > On July 3, 2019 5:14:58 PM GMT+02:00, "Paul E. McKenney" > wrote: > >On Wed, Jul 03, 2019 at 12:39:41AM +0530, Akshat Garg wrote: > >> On Tue, Jul 2, 2019 at 8:40 PM Paul E. McKenney > > > >> wrote: > >> > >> > On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan > >wrote: > >> > > On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney > > > >> > wrote: > >> > > > >> > > > > >> > > > Once a user-created non-dependent pointer is assigned to, it is > >OK to > >> > > > break the dependency. > >> > > > >> > > Ok, that's good. > >> > > > > >> > > > Or am I missing the point here? > >> > > > >> > > I was just trying to make sure we were on the same page. I wonder > >if > >> > > marking this volatile would be sufficient for prototyping. I > >suspect > >> > > we would need another flag somewhere which someone with gimple > >> > > knowledge might be able to help us with. > >> > > >> > I expect that marking it as volatile would do the trick. ;-) > >> > > >> > Thanx, Paul > >> > > >> So, marking this pointer as volatile will not allow the compiler to > >> modify/optimize the statements, the pointer is appearing in. And we > >don't > >> need to push any other code inside any of the passes. Due to this, we > >want > >> to automatically say those dependent pointers are volatile and > >introduce a > >> new flag for this. Am I getting you guys correctly? Kindly, let me > >know? > > > >While I suspect that this might work, it would suppress way more > >optimizations than would be good. For but one example, consider: > > > > _Dependent_ptr int *p; > > > > p = atomic_load_explicit(gp, memory_order_consume); > > a = p->a; > > b = p->b; > > > >If "p" is volatile, then the compiler will be prevented from keeping > >it in a register, which would not make people coding fastpaths at > >all happy. ;-) > > > >Still, use of volatile might be a good technique for prototyping and > >analysis of _Dependent_ptr. > > With this example can you quickly summarize what kind of guarantees > _Dependent_ptr gives and how a compiler > Could possibly break those? First I suppose I should fix the bug in the above code. Or one of the bugs, at least. :-/ struct foo { int a; int b; }; _Dependent_ptr struct foo *p; p = atomic_load_explicit(gp, memory_order_consume); a = p->a; b = p->b; And then let me tweak the example a bit. For the first tweak: struct foo { int a; int b; }; struct foo default_foo = { .a = 42, .b = 43 }; int *gp = _foo; ... _Dependent_ptr int *p; p = atomic_load_explicit(gp, memory_order_consume); a = p->a; b = p->b; Suppose that the compiler used feedback-driven optimization, and noticed that the value of gp was almost always _foo. The compiler might decide to transform the last three lines as follows: p = atomic_load_explicit(gp, memory_order_consume); if (p == _foo) { a = default_foo.a; b = default_foo.b; } else { a = p->a; b = p->b; } Now, as long as the value of gp had remained _foo for the full duration of execution, no harm done. But suppose the following code was executing concurrently with the above transformed code: struct foo *q; q = malloc(sizeof(*q)); assert(q); q->a = 1729; q->b = 1730; atomic_store_explicit(gp, q, memory_order_release); do_something(); default_foo.a = 1; default_foo.b = 2; atomic_store_explicit(gp, _foo, memory_order_release); In this case, if the memory_order_consume() came just after the pointer was reset to _foo, it is possible that the transformed code would set "a" to 42 and "b" to 43, which might not be what the guy writing the code wanted to happen. One of the purposes of _Dependent_ptr is to prevent this transformation. This transformation can also happen if the developer's code contained a comparison to _foo -- an ARM or PowerPC compiler backend, upon seeing two pointers containing the same bits, would likely consider the two pointers as being intercha
Re: Doubts regarding the _Dependent_ptr keyword
On Wed, Jul 03, 2019 at 12:39:41AM +0530, Akshat Garg wrote: > On Tue, Jul 2, 2019 at 8:40 PM Paul E. McKenney > wrote: > > > On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan wrote: > > > On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney > > wrote: > > > > > > > > > > > Once a user-created non-dependent pointer is assigned to, it is OK to > > > > break the dependency. > > > > > > Ok, that's good. > > > > > > > > Or am I missing the point here? > > > > > > I was just trying to make sure we were on the same page. I wonder if > > > marking this volatile would be sufficient for prototyping. I suspect > > > we would need another flag somewhere which someone with gimple > > > knowledge might be able to help us with. > > > > I expect that marking it as volatile would do the trick. ;-) > > > > Thanx, Paul > > > So, marking this pointer as volatile will not allow the compiler to > modify/optimize the statements, the pointer is appearing in. And we don't > need to push any other code inside any of the passes. Due to this, we want > to automatically say those dependent pointers are volatile and introduce a > new flag for this. Am I getting you guys correctly? Kindly, let me know? While I suspect that this might work, it would suppress way more optimizations than would be good. For but one example, consider: _Dependent_ptr int *p; p = atomic_load_explicit(gp, memory_order_consume); a = p->a; b = p->b; If "p" is volatile, then the compiler will be prevented from keeping it in a register, which would not make people coding fastpaths at all happy. ;-) Still, use of volatile might be a good technique for prototyping and analysis of _Dependent_ptr. Thanx, Paul > Akshat > > > > > > regards > > > Ramana > > > > > > > > > > > Thanx, Paul > > > > > > > > > Ramana > > > > > >> > > > > > >> > > > > > >> > Does this sounds like a workable plan for ? Let me know your > > thoughts. If this sounds good then, we can do this for all the > > optimizations that may kill the dependencies at somepoint. > > > > > >> > > > > > > >> > -Akshat > > > > > > > > > > > > > > > >
Re: Doubts regarding the _Dependent_ptr keyword
On Tue, Jul 02, 2019 at 07:53:20PM +0200, Richard Biener wrote: > On July 2, 2019 5:36:08 PM GMT+02:00, Jason Merrill wrote: > >On Mon, Jul 1, 2019 at 8:59 PM Paul E. McKenney > >wrote: > >> > >> On Tue, Jul 02, 2019 at 05:58:48AM +0530, Akshat Garg wrote: > >> > On Tue, Jun 25, 2019 at 9:49 PM Akshat Garg > >wrote: > >> > > >> > > On Tue, Jun 25, 2019 at 4:04 PM Ramana Radhakrishnan < > >> > > ramana@googlemail.com> wrote: > >> > > > >> > >> On Tue, Jun 25, 2019 at 11:03 AM Akshat Garg > >wrote: > >> > >> > > >> > >> > As we have some working front-end code for _Dependent_ptr, > >What should > >> > >> we do next? What I understand, we can start adding the library > >for > >> > >> dependent_ptr and its functions for C corresponding to the ones > >we created > >> > >> as C++ template library. Then, after that, we can move on to > >generating the > >> > >> assembly code part. > >> > >> > > >> > >> > >> > >> > >> > >> I think the next step is figuring out how to model the Dependent > >> > >> pointer information in the IR and figuring out what > >optimizations to > >> > >> allow or not with that information. At this point , I suspect we > >need > >> > >> a plan on record and have the conversation upstream on the > >lists. > >> > >> > >> > >> I think we need to put down a plan on record. > >> > >> > >> > >> Ramana > >> > > > >> > > [CCing gcc mailing list] > >> > > > >> > > So, shall I start looking over the pointer optimizations only and > >see what > >> > > information we may be needed on the same examples in the IR > >itself? > >> > > > >> > > - Akshat > >> > > > >> > I have coded an example where equality comparison kills dependency > >from the > >> > document P0190R4 as shown below : > >> > > >> > 1. struct rcutest rt = {1, 2, 3}; > >> > 2. void thread0 () > >> > 3. { > >> > 4. rt.a = -42; > >> > 5. rt.b = -43; > >> > 6. rt.c = -44; > >> > 7. rcu_assign_pointer(gp, ); > >> > 8. } > >> > 9. > >> > 10. void thread1 () > >> > 11. { > >> > 12.int i = -1; > >> > 13.int j = -1; > >> > 14._Dependent_ptr struct rcutest *p; > >> > 15. > >> > 16.p = rcu_dereference(gp); > >> > 17.j = p->a; > >> > 18. if (p == ) > >> > 19.i = p->b; /*Dependency breaking point*/ > >> > 20. else if(p) > >> > 21. i = p->c; > >> > 22. assert(i<0); > >> > 23. assert(j<0); > >> > 24. } > >> > The gimple unoptimized code produced for lines 17-24 is shown below > >> > > >> > 1. if (p_16 == ) > >> > 2. goto ; [INV] > >> > 3. else > >> > 4.goto ; [INV] > >> > 5. > >> > 6. : > >> > 7. i_19 = p_16->b; > >> > 8. goto ; [INV] > >> > 9. > >> > 10. : > >> > 11. if (p_16 != 0B) > >> > 12.goto ; [INV] > >> > 13. else > >> > 14.goto ; [INV] > >> > 15. > >> > 16. : > >> > 17. i_18 = p_16->c; > >> > 18. > >> > 19. : > >> > 20. # i_7 = PHI > >> > 21. _3 = i_7 < 0; > >> > 22. _4 = (int) _3; > >> > 23. assert (_4); > >> > 24. _5 = j_17 < 0; > >> > 25. _6 = (int) _5; > >> > 26. assert (_6); > >> > 27. return; > >> > > >> > The optimized code after -O1 is applied for the same lines is hown > >below : > >> > > >> > 1. if (_2 == ) > >> > 2.goto ; [30.00%] > >> > 3. else > >> > 4.goto ; [70.00%] > >> > 5. > >> > 6. [local count: 322122547]: > >> > 7. i_12 = rt.b; > >> > 8. goto ; [100.00%] > >> > 9. > >> > 10. [local count: 751619277]: > >> > 11. if (_1 != 0) > >> > 12. goto ; [50.00%] > >> > 13. else > >> > 14.goto ;
Re: Doubts regarding the _Dependent_ptr keyword
On Tue, Jul 02, 2019 at 02:15:55PM +0100, Ramana Radhakrishnan wrote: > On Tue, Jul 2, 2019 at 1:38 PM Paul E. McKenney wrote: > > > > > Once a user-created non-dependent pointer is assigned to, it is OK to > > break the dependency. > > Ok, that's good. > > > > Or am I missing the point here? > > I was just trying to make sure we were on the same page. I wonder if > marking this volatile would be sufficient for prototyping. I suspect > we would need another flag somewhere which someone with gimple > knowledge might be able to help us with. I expect that marking it as volatile would do the trick. ;-) Thanx, Paul > regards > Ramana > > > > > Thanx, Paul > > > > > Ramana > > > >> > > > >> > > > >> > Does this sounds like a workable plan for ? Let me know your > > > >> > thoughts. If this sounds good then, we can do this for all the > > > >> > optimizations that may kill the dependencies at somepoint. > > > >> > > > > >> > -Akshat > > > > > >
Re: Doubts regarding the _Dependent_ptr keyword
On Tue, Jul 02, 2019 at 12:01:00PM +0100, Ramana Radhakrishnan wrote: > >> > >> It's worth figuring out what passes are doing this - however the worry > >> I have is that every pass now needs to be handling this case with > >> respect to pointer attributes. Is there some place that you are > >> storing said information and what is the transitive nature of > >> assignments with these attributes ? > >> > >> regards > >> Ramana > > > > I don't understand what information to store, can you explain? I was > > thinking of putting some code inside the pass, which breaks the dependency > > chains, which will avoid this type of conversions when it finds out that > > the pointer is _Dependent_ptr qualified otherwise don't do anything. Can > > you let me know what I am missing on? > > > > That the pointer has an attribute that it's a dependent pointer . How > do you expect the later passes to detect it has a dependent_ptr > attribute attached to it ? > > > In transitive nature, are you talking about the dependent pointer being > > assigned to some non-dependent pointer and after further assigned to some > > other pointer? > > Yes. What's the expected behaviour as per the document ? Once a user-created non-dependent pointer is assigned to, it is OK to break the dependency. Or am I missing the point here? Thanx, Paul > Ramana > >> > >> > >> > Does this sounds like a workable plan for ? Let me know your thoughts. > >> > If this sounds good then, we can do this for all the optimizations that > >> > may kill the dependencies at somepoint. > >> > > >> > -Akshat >
Re: Doubts regarding the _Dependent_ptr keyword
On Tue, Jul 02, 2019 at 05:58:48AM +0530, Akshat Garg wrote: > On Tue, Jun 25, 2019 at 9:49 PM Akshat Garg wrote: > > > On Tue, Jun 25, 2019 at 4:04 PM Ramana Radhakrishnan < > > ramana@googlemail.com> wrote: > > > >> On Tue, Jun 25, 2019 at 11:03 AM Akshat Garg wrote: > >> > > >> > As we have some working front-end code for _Dependent_ptr, What should > >> we do next? What I understand, we can start adding the library for > >> dependent_ptr and its functions for C corresponding to the ones we created > >> as C++ template library. Then, after that, we can move on to generating the > >> assembly code part. > >> > > >> > >> > >> I think the next step is figuring out how to model the Dependent > >> pointer information in the IR and figuring out what optimizations to > >> allow or not with that information. At this point , I suspect we need > >> a plan on record and have the conversation upstream on the lists. > >> > >> I think we need to put down a plan on record. > >> > >> Ramana > > > > [CCing gcc mailing list] > > > > So, shall I start looking over the pointer optimizations only and see what > > information we may be needed on the same examples in the IR itself? > > > > - Akshat > > > I have coded an example where equality comparison kills dependency from the > document P0190R4 as shown below : > > 1. struct rcutest rt = {1, 2, 3}; > 2. void thread0 () > 3. { > 4. rt.a = -42; > 5. rt.b = -43; > 6. rt.c = -44; > 7. rcu_assign_pointer(gp, ); > 8. } > 9. > 10. void thread1 () > 11. { > 12.int i = -1; > 13.int j = -1; > 14._Dependent_ptr struct rcutest *p; > 15. > 16.p = rcu_dereference(gp); > 17.j = p->a; > 18. if (p == ) > 19.i = p->b; /*Dependency breaking point*/ > 20. else if(p) > 21. i = p->c; > 22. assert(i<0); > 23. assert(j<0); > 24. } > The gimple unoptimized code produced for lines 17-24 is shown below > > 1. if (p_16 == ) > 2. goto ; [INV] > 3. else > 4.goto ; [INV] > 5. > 6. : > 7. i_19 = p_16->b; > 8. goto ; [INV] > 9. > 10. : > 11. if (p_16 != 0B) > 12.goto ; [INV] > 13. else > 14.goto ; [INV] > 15. > 16. : > 17. i_18 = p_16->c; > 18. > 19. : > 20. # i_7 = PHI > 21. _3 = i_7 < 0; > 22. _4 = (int) _3; > 23. assert (_4); > 24. _5 = j_17 < 0; > 25. _6 = (int) _5; > 26. assert (_6); > 27. return; > > The optimized code after -O1 is applied for the same lines is hown below : > > 1. if (_2 == ) > 2.goto ; [30.00%] > 3. else > 4.goto ; [70.00%] > 5. > 6. [local count: 322122547]: > 7. i_12 = rt.b; > 8. goto ; [100.00%] > 9. > 10. [local count: 751619277]: > 11. if (_1 != 0) > 12. goto ; [50.00%] > 13. else > 14.goto ; [50.00%] > 15. > 16. [local count: 375809638]: > 17. i_11 = MEM[(dependent_ptr struct rcutest *)_2].c; > 18. > 19.[local count: 1073741824]: > 20. # i_7 = PHI > 21. _3 = i_7 < 0; > 22. _4 = (int) _3; > 23. assert (_4); > 24. _5 = j_10 < 0; > 25. _6 = (int) _5; > 26. assert (_6); > 27. return; Good show on tracing this through! > Statement 19 in the program gets converted from i_19 = p_16->b; in line 7 > in unoptimized code to i_12 = rt.b; in line 7 in optimized code which > breaks the dependency chain. We need to figure out the pass that does that > and put some handling code in there for the _dependent_ptr qualified > pointers. Passing simply -fipa-pure-const, -fguess-branch-probability or > any other option alone does not produce the optimized code that breaks the > dependency. But applying -O1, i.e., allowing all the optimizations does so. > As passes are applied in a certain order, we need to figure out upto what > passes, the code remains same and after what pass the dependency does not > holds. So, we need to check the translated code after every pass. > > Does this sounds like a workable plan for ? Let me know your thoughts. If > this sounds good then, we can do this for all the optimizations that may > kill the dependencies at somepoint. I don't know of a better plan. My usual question... Is there some way to script the checking of the translated code at the end of each pass? Thanx, Paul
Re: Importance of transformations that turn data dependencies into control dependencies?
On Tue, Mar 01, 2016 at 05:55:07PM +0100, Michael Matz wrote: > Hi, > > On Tue, 1 Mar 2016, Richard Biener wrote: > > > > What about the example I gave above? Is it unrealistic for compilers > > > do ever do something like this, or is it just unlikely to gain much > > > performance, or is it just that GCC does not do this today? > > > > GCC does not do this today with the exception of value-profiling. GCC > > in other cases does not establish equivalences but only relations (a < > > b, etc.) that are not a problem as far as I can see because those do not > > allow to change expressions using a to use b. > > Made up example using relations: > > int32 a, b; > a = (b >> 31) & 1; > > --> > > if (b < 0) > a = 1; > else > a = 0; > > data-dep to control-dep and only relations :) (I think this is taken care > of by Pauls wording, ignoring the fact that these aren't pointers anyway > and hence don't carry a dependency through them, only onto them at max) Agreed, these are not pointers, and therefore do not carry dependencies. Thanx, Paul
Re: [isocpp-parallel] Proposal for new memory_order_consume definition
On Mon, Feb 29, 2016 at 07:17:55PM +0100, Michael Matz wrote: > Hi, > > On Sat, 27 Feb 2016, Paul E. McKenney wrote: > > > But we do already have something very similar with signed integer > > overflow. If the compiler can see a way to generate faster code that > > does not handle the overflow case, then the semantics suddenly change > > from twos-complement arithmetic to something very strange. The standard > > does not specify all the ways that the implementation might deduce that > > faster code can be generated by ignoring the overflow case, it instead > > simply says that signed integer overflow invoked undefined behavior. > > > > And if that is a problem, you use unsigned integers instead of signed > > integers. > > > > So it seems that we should be able to do something very similar here. > > For this case the important pice of information to convey one or the other > meaning in source code is the _type_ of involved entities, not annotations > on the operations. signed type -> undefined overflow, unsigned type -> > modulo arithmetic; easy, and it nicely carries automatically through > operation chains (and pointers) without any annotations. > > I feel much of the complexity in the memory order specifications, also > with your recent (much better) wording to explain dependency chains, would > be much easier if the 'carries-dependency' would be encoded into the types > of operands. For purpose of example, let's call the marker "blaeh" (not > atomic to not confuse with existing use :) ): > > int foo; > blaeh int global; > int *somep; > blae int *blaehp; > f () { > blaehp = // might be okay, adds restrictions on accesses through > // blaehp, but not through 'foo' directly > blaehp = > if (somep == blaehp) > { > /* Even though the value is equal ... */ > ... *blaehp ... /* ... a compiler can't rewrite this into *somep */ > } > } > > A "carries-dependency" on some operation (e.g. a call) would be added by > using a properly typed pointer at those arguments (or return type) where > it matters. You can't give a blaeh pointer to something only accepting > non-blaeh pointers (without cast). > > Pointer addition and similar transformations involving a blaeh pointer and > some integer would still give a blaeh pointer, and hence by default also > solve the problem of cancellations. > > Such marking via types would not solve all problems in an optimal way if > you had two overlapping but independend dependency chains (all of them > would collapse to one chain and hence made dependend, which still is > conservatively correct). > > OTOH introducing new type qualifiers is a much larger undertaking, so I > can understand one wants to avoid this. I think it'd ultimately be > clearer, though. As has been stated in this thread, we do need the unmarked variant. For the marked variant, there are quite a few possible solutions with varying advantages and disadvantages: o Attribute already exists, but is not carried by the type system. Could be enforced by external tools. o Storage class could be added with fewer effects on the type system, but the reaction to this suggestion in October was not all that positive. o Non-type keywords for objects has been suggested, might be worth revisiting. o Adding to the type system allows type enforcement on the one hand, but makes it harder to write code that can be used for both RCU-protected and not-RCU-protected data structures. (This sort of thing is not uncommon in the Linux kernel.) There are probably others, but those are the ones I recall at the moment. Thanx, Paul
Re: [isocpp-parallel] Proposal for new memory_order_consume definition
On Sat, Feb 27, 2016 at 11:16:51AM -0800, Linus Torvalds wrote: > On Feb 27, 2016 09:06, "Paul E. McKenney" <paul...@linux.vnet.ibm.com> > wrote: > > > > > > But we do already have something very similar with signed integer > > overflow. If the compiler can see a way to generate faster code that > > does not handle the overflow case, then the semantics suddenly change > > from twos-complement arithmetic to something very strange. The standard > > does not specify all the ways that the implementation might deduce that > > faster code can be generated by ignoring the overflow case, it instead > > simply says that signed integer overflow invoked undefined behavior. > > > > And if that is a problem, you use unsigned integers instead of signed > > integers. > > Actually, in the case of there Linux kernel we just tell the compiler to > not be an ass. We use > > -fno-strict-overflow That is the one! > or something. I forget the exact compiler flag needed for "the standard is > as broken piece of shit and made things undefined for very bad reasons". > > See also there idiotic standard C alias rules. Same deal. For which we use -fno-strict-aliasing. > So no, standards aren't that important. When the standards screw up, the > right answer is not to turn the other cheek. Agreed, hence my current (perhaps quixotic and insane) attempt to get the standard to do something useful for dependency ordering. But if that doesn't work, yes, a fallback position is to get the relevant compilers to provide flags to avoid problematic behavior, similar to -fno-strict-overflow. Thanx, Paul > And undefined behavior is pretty much *always* a sign of "the standard is > wrong". > > Linus
Re: [isocpp-parallel] Proposal for new memory_order_consume definition
On Thu, Feb 25, 2016 at 04:46:50PM -0800, Hans Boehm wrote: > If carries_dependency affects semantics, then it should not be an attribute. I am not picky about the form of the marking. > The original design, or at least my understanding of it, was that it not > have semantics; it was only a suggestion to the compiler that it should > preserve dependencies instead of inserting a fence at the call site. > Dependency-based ordering would be preserved in either case. But I think > we're moving away from that view towards something that doesn't quietly add > fences. Yes, we do need to allow typical implementations to avoid quiet fence addition. > I do not think we can quite get away with defining a dependency in a way > that is unconditionally preserved by existing compilers, and thus I think > that we do probably need annotations along the dependency path. I just > don't see a way to otherwise deal with the case in which a compiler infers > an equivalent pointer and dereferences that instead of the original. This > can happen under so many (unlikely but) hard-to-define conditions that it > seems undefinable in an implementation-independent manner. "If the > implementation is able then " is, in my opinion, not > acceptable standards text. Hmmm... But we do already have something very similar with signed integer overflow. If the compiler can see a way to generate faster code that does not handle the overflow case, then the semantics suddenly change from twos-complement arithmetic to something very strange. The standard does not specify all the ways that the implementation might deduce that faster code can be generated by ignoring the overflow case, it instead simply says that signed integer overflow invoked undefined behavior. And if that is a problem, you use unsigned integers instead of signed integers. So it seems that we should be able to do something very similar here. If you don't use marking, and the compiler deduces that a given pointer that carries a given dependency is equal to some other pointer not carrying that same dependency, there is no dependency ordering. And, just as with the signed-integer-overflow case, if that is a problem for you, you can mark the pointers that you intend to carry dependencies. In both the signed-integer-overflow and pointer-value-deduction cases, most use cases don't need to care. In the integer case, this is because most use cases have small integer values that don't overflow. In the pointer case, this is because when the data structure is composed of lots of heap-allocated data items, the compiler really cannot deduce anything. Other safe pointer use cases involve statically allocated data items whose contents are compile-time constants (thus avoiding the need for any sort of ordering) and sentinel data items (as in the Linux kernel's cicular linked lists) where there is no dereferencing. > Thus I see no way to both avoid adding syntax to functions that preserve > dependencies and continue to allow existing transformations that remove > dependencies we care about, e.g. due to equality comparisons. We can > hopefully ensure that without annotations compilers break things with very > low probability, so that there is a reasonable path forward for existing > code relying on dependency ordering (which currently also breaks with very > low probability unless you understand what the compiler is doing). But I > don't see a way for the standard to guarantee correctness without the added > syntax (or added optimization constraints that effectively assume all > functions were annotated). Your second sentence ("We can hopefully ensure...") does give me hope that we might be able to reach agreement. The intent of P0190R0 is to define a subset of operations where dependencies will be carried. Note that P0190R0 does call out comparisons as potentially unsafe. Thanx, Paul > On Sat, Feb 20, 2016 at 11:53 AM, Paul E. McKenney < > paul...@linux.vnet.ibm.com> wrote: > > > On Fri, Feb 19, 2016 at 09:15:16PM -0500, Tony V E wrote: > > > There's at least one easy answer in there: > > > > > > > If implementations must support annotation, what form should that > > > annotation take? P0190R0 recommends the [[carries_dependency]] > > > attribute, but I am not picky as long as it can be (1) applied > > > to all relevant pointer-like objects and (2) used in C as well > > > as C++. ;-) > > > > > > If an implementation must support it, then it is not an annotation but a > > keyword. So no [[]] > > > > I would be good with that approach, especially if the WG14 continues > > to stay away from annotations. > > > > For whatever it is worth, the introduction o
Re: Importance of transformations that turn data dependencies into control dependencies?
On Fri, Feb 26, 2016 at 11:49:29AM +0100, Richard Biener wrote: > On Thu, Feb 25, 2016 at 6:33 PM, Torvald Riegelwrote: > > On Wed, 2016-02-24 at 13:14 +0100, Richard Biener wrote: > >> On Tue, Feb 23, 2016 at 8:38 PM, Torvald Riegel wrote: > >> > I'd like to know, based on the GCC experience, how important we consider > >> > optimizations that may turn data dependencies of pointers into control > >> > dependencies. I'm thinking about all optimizations or transformations > >> > that guess that a pointer might have a specific value, and then create > >> > (specialized) code that assumes this value that is only executed if the > >> > pointer actually has this value. For example: > >> > > >> > int d[2] = {23, compute_something()}; > >> > > >> > int compute(int v) { > >> > if (likely(v == 23)) return 23; > >> > else ; > >> > } > >> > > >> > int bar() { > >> > int *p = ptr.load(memory_order_consume); > >> > size_t reveal_that_p_is_in_d = p - d[0]; > >> > return compute(*p); > >> > } > >> > > >> > Could be transformed to (after inlining compute(), and specializing for > >> > the likely path): > >> > > >> > int bar() { > >> > int *p = ptr.load(memory_order_consume); > >> > if (p == d) return 23; > >> > else ; > >> > } > >> > >> Note that if a user writes > >> > >> if (p == d) > >>{ > >> ... do lots of stuff via p ... > >>} > >> > >> GCC might rewrite accesses to p as accesses to d and thus expose > >> those opportunities. Is that a transform that isn't valid then or is > >> the code written by the user (establishing the equivalency) to blame? > > > > In the context of this memory_order_consume proposal, this transform > > would be valid because the program has already "reveiled" what value p > > has after the branch has been taken. > > > >> There's a PR where this kind of equivalencies lead to unexpected (wrong?) > >> points-to results for example. > >> > >> > Other potential examples that come to mind are de-virtualization, or > >> > feedback-directed optimizations that has observed at runtime that a > >> > certain pointer is likely to be always equal to some other pointer (eg., > >> > if p is almost always d[0], and specializing for that). > >> > >> That's the cases that are quite important in practice. > > > > Could you quantify this somehow, even if it's a very rough estimate? > > I'm asking because it's significant and widely used, then this would > > require users or compiler implementors to make a difficult trade-off > > (ie, do you want mo_consume performance or performance through those > > other optimizations?). > > Probably resoved by your followup on how the transform is safe anyway. > > >> > Also, it would be interesting to me to know how often we may turn data > >> > dependencies into control dependencies in cases where this doesn't > >> > affect performance significantly. > >> > >> I suppose we try to avoid that but can we ever know for sure? Like > >> speculative devirtualization does this (with the intent that it _does_ > >> matter, > >> of course). > >> > >> I suppose establishing non-dependence isn't an issue, like with the > >> vectorizer adding runtime dependence checks and applying versioning > >> to get a vectorized and a not vectorized loop (in case there are > >> dependences)? > > > > I'm not sure I understand you correctly. Do you have a brief example, > > perhaps? For mo_consume and its data dependencies, if there might be a > > dependence, the compiler would have to preserve it; but I guess that > > both a vectorized loop an one that accessses each element separately > > would preserve dependences because it's doing those accesses, and they > > depend on the input data. > > OTOH, peraps HW vector instructions don't get the ordering guarantees > > from data dependences -- Paul, do you know of any such cases? > > A brief example would be for > > void foo (int *a, int *b, int n) > { > for (int i = 0; i < n; ++i) >a[i] = b[i]; > } > > which we can vectorize like > > if (a + n < b || b + n < a) >{ > vectorized loop >} > else > { >not vectorized loop > } > > note how we're not establishing equivalences between pointers but > non-dependence vs. possible dependence. Thank you for the clarification, I will check. If I am not too confused, such a loop might want to used x86 non-temporal stores, which require special handling even with acquire and release, but that is presumably already handled. Thanx, Paul > >> > The background for this question is Paul McKenney's recently updated > >> > proposal for a different memory_order_consume specification: > >> > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0190r0.pdf > >> > > >> > In a nutshell, this requires a compiler to either prove that a pointer > >> > value is not carrying a dependency (simplified, its value somehow > >> > originates from a
Re: [isocpp-parallel] Proposal for new memory_order_consume definition
On Fri, Feb 19, 2016 at 09:15:16PM -0500, Tony V E wrote: > There's at least one easy answer in there: > > > If implementations must support annotation, what form should that > annotation take? P0190R0 recommends the [[carries_dependency]] > attribute, but I am not picky as long as it can be (1) applied > to all relevant pointer-like objects and (2) used in C as well > as C++. ;-) > > If an implementation must support it, then it is not an annotation but a > keyword. So no [[]] I would be good with that approach, especially if the WG14 continues to stay away from annotations. For whatever it is worth, the introduction of intrinsics for comparisons that avoid breaking dependencies enables the annotation to remain optional. Thanx, Paul > Sent from my BlackBerry portable Babbage Device > Original Message > From: Paul E. McKenney > Sent: Thursday, February 18, 2016 4:58 AM > To: paral...@lists.isocpp.org; linux-ker...@vger.kernel.org; > linux-a...@vger.kernel.org; gcc@gcc.gnu.org; llvm-...@lists.llvm.org > Reply To: paral...@lists.isocpp.org > Cc: pet...@infradead.org; j.algl...@ucl.ac.uk; will.dea...@arm.com; > dhowe...@redhat.com; ramana.radhakrish...@arm.com; luc.maran...@inria.fr; > a...@linux-foundation.org; peter.sew...@cl.cam.ac.uk; > torva...@linux-foundation.org; mi...@kernel.org > Subject: [isocpp-parallel] Proposal for new memory_order_consume definition > > Hello! > > A proposal (quaintly identified as P0190R0) for a new memory_order_consume > definition may be found here: > > http://www2.rdrop.com/users/paulmck/submission/consume.2016.02.10b.pdf > > As requested at the October C++ Standards Committee meeting, this > is a follow-on to P0098R1 that picks one alternative and describes > it in detail. This approach focuses on existing practice, with the > goal of supporting existing code with existing compilers. In the last > clang/LLVM patch I saw for basic support of this change, you could count > the changed lines and still have lots of fingers and toes left over. > Those who have been following this story will recognize that this is > a very happy contrast to work that would be required to implement the > definition in the current standard. > > I expect that P0190R0 will be discussed at the upcoming C++ Standards > Committee meeting taking place the week of February 29th. Points of > discussion are likely to include: > > o May memory_order_consume dependency ordering be used in > unannotated code? I believe that this must be the case, > especially given that this is our experience base. P0190R0 > therefore recommends this approach. > > o If memory_order_consume dependency ordering can be used in > unannotated code, must implementations support annotation? > I believe that annotation support should be required, at the very > least for formal verification, which can be quite difficult to > carry out on unannotated code. In addition, it seems likely > that annotations can enable much better diagnostics. P0190R0 > therefore recommends this approach. > > o If implementations must support annotation, what form should that > annotation take? P0190R0 recommends the [[carries_dependency]] > attribute, but I am not picky as long as it can be (1) applied > to all relevant pointer-like objects and (2) used in C as well > as C++. ;-) > > o If memory_order_consume dependency ordering can be used in > unannotated code, how best to define the situations where > the compiler can determine the exact value of the pointer in > question? (In current defacto implementations, this can > defeat dependency ordering. Interestingly enough, this case > is not present in the Linux kernel, but still needs to be > defined.) > > Options include: > > o Provide new intrinsics that carry out the > comparisons, but guarantee to preserve dependencies, > as recommended by P0190R0 (std::pointer_cmp_eq_dep(), > std::pointer_cmp_ne_dep(), std::pointer_cmp_gt_dep(), > std::pointer_cmp_ge_dep(), std::pointer_cmp_lt_dep(), > and std::pointer_cmp_le_dep()). > > o State that -any- comparison involving an unannotated > pointer loses the dependency. > > o How is the common idiom of marking pointers by setting low-order > bits to be supported when those pointers carry dependencies? > At the moment, I believe that setting bits in pointers results in > undefined behavior even without dependency ordering, so P0190R0 > kicks this particular can down the road. One option that > has been suggested is to provide intrinsics for this purpose. > (Sorry, but I forget who suggested this.) >
Re: [c++std-parallel-2008] Re: Compilers and RCU readers: Once more unto the breach!
r choices, regardless of my opinions.) That marking could be the [[carries_dependency]] attribute (at least assuming that C adds attributes), a type modifier (which might or might not go over well in Core), a variable modifier (which I don't understand well enough to have an opinion), or a storage class as suggested in 7.10. I am not all that worried about exactly what sort of marking we use, but I am convinced that marking is required. So, given that we need to mark things, what sort of marking would work best from your perspective? Thanx, Paul > On Tue, Sep 22, 2015 at 10:00 AM, Paul E. McKenney < > paul...@linux.vnet.ibm.com> wrote: > > > On Mon, Jul 13, 2015 at 05:44:59PM -0700, Paul E. McKenney wrote: > > > On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote: > > > > Hello! > > > > > > > > Following up on last year's discussion ( > > https://lwn.net/Articles/586838/, > > > > https://lwn.net/Articles/588300/), I believe that we have a > > solution. If > > > > I am wrong, I am sure you all will let me know, and in great detail. > > ;-) > > > > > > > > The key simplification is to "just say no" to RCU-protected array > > indexes: > > > > https://lkml.org/lkml/2015/5/12/827, as was suggested by several > > people. > > > > This simplification means that rcu_dereference (AKA > > memory_order_consume) > > > > need only return pointers. This in ture avoids things like (x-x), > > > > (x*0), and (x%1) because if "x" is a pointer, these expressions either > > > > return non-pointers are compilation errors. With a very few > > exceptions, > > > > dependency chains can lead -to- non-pointers, but cannot pass -through- > > > > them. > > > > > > > > The result is that dependencies are carried only by operations for > > > > which the compiler cannot easily optimize the away those dependencies, > > > > these operations including simple assignment, integer offset (including > > > > indexing), dereferencing, casts, passing as a function argument, return > > > > values from functions and so on. A complete list with commentary > > starts > > > > on page 28 of: > > > > > > > > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf > > > > > > And an update is available here: > > > > > > http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf > > > > > > Among other things, this update addresses the points about equality > > > comparisons introduced by the compiler, and outlines some of the > > > issues head-/tail-marked alternatives face with respect to abstraction. > > > The updated "Restricted Dependency Chains" section starts on page 28. > > > > > > Thoughts? > > > > This is updated based on many offline discussions. Section 7.9 has > > been updated accordingly, but still starts on page 28. This approach is > > again intended for large existing RCU code bases such as the Linux kernel. > > > > A new Section 7.10 starting on page 35 describes a possible longer-term > > solution for new-to-RCU code bases. This approach adds a storage class > > that indicates that the marked object carries dependencies. This allows > > better diagnostics (and arguably better code self-documentation) as well > > as providing something that formal-verification tools can handle while > > avoiding the need for compilers to trace dependency chains. > > > > http://www.rdrop.com/users/paulmck/submission/consume.2015.09.22a.pdf > > > > Thoughts? > > > > Thanx, Paul > > > > > >
Re: Compilers and RCU readers: Once more unto the breach!
On Mon, Jul 13, 2015 at 05:44:59PM -0700, Paul E. McKenney wrote: > On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote: > > Hello! > > > > Following up on last year's discussion (https://lwn.net/Articles/586838/, > > https://lwn.net/Articles/588300/), I believe that we have a solution. If > > I am wrong, I am sure you all will let me know, and in great detail. ;-) > > > > The key simplification is to "just say no" to RCU-protected array indexes: > > https://lkml.org/lkml/2015/5/12/827, as was suggested by several people. > > This simplification means that rcu_dereference (AKA memory_order_consume) > > need only return pointers. This in ture avoids things like (x-x), > > (x*0), and (x%1) because if "x" is a pointer, these expressions either > > return non-pointers are compilation errors. With a very few exceptions, > > dependency chains can lead -to- non-pointers, but cannot pass -through- > > them. > > > > The result is that dependencies are carried only by operations for > > which the compiler cannot easily optimize the away those dependencies, > > these operations including simple assignment, integer offset (including > > indexing), dereferencing, casts, passing as a function argument, return > > values from functions and so on. A complete list with commentary starts > > on page 28 of: > > > > http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf > > And an update is available here: > > http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf > > Among other things, this update addresses the points about equality > comparisons introduced by the compiler, and outlines some of the > issues head-/tail-marked alternatives face with respect to abstraction. > The updated "Restricted Dependency Chains" section starts on page 28. > > Thoughts? This is updated based on many offline discussions. Section 7.9 has been updated accordingly, but still starts on page 28. This approach is again intended for large existing RCU code bases such as the Linux kernel. A new Section 7.10 starting on page 35 describes a possible longer-term solution for new-to-RCU code bases. This approach adds a storage class that indicates that the marked object carries dependencies. This allows better diagnostics (and arguably better code self-documentation) as well as providing something that formal-verification tools can handle while avoiding the need for compilers to trace dependency chains. http://www.rdrop.com/users/paulmck/submission/consume.2015.09.22a.pdf Thoughts? Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Tue, May 19, 2015 at 05:55:10PM -0700, Paul E. McKenney wrote: Hello! Following up on last year's discussion (https://lwn.net/Articles/586838/, https://lwn.net/Articles/588300/), I believe that we have a solution. If I am wrong, I am sure you all will let me know, and in great detail. ;-) The key simplification is to just say no to RCU-protected array indexes: https://lkml.org/lkml/2015/5/12/827, as was suggested by several people. This simplification means that rcu_dereference (AKA memory_order_consume) need only return pointers. This in ture avoids things like (x-x), (x*0), and (x%1) because if x is a pointer, these expressions either return non-pointers are compilation errors. With a very few exceptions, dependency chains can lead -to- non-pointers, but cannot pass -through- them. The result is that dependencies are carried only by operations for which the compiler cannot easily optimize the away those dependencies, these operations including simple assignment, integer offset (including indexing), dereferencing, casts, passing as a function argument, return values from functions and so on. A complete list with commentary starts on page 28 of: http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf And an update is available here: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf Among other things, this update addresses the points about equality comparisons introduced by the compiler, and outlines some of the issues head-/tail-marked alternatives face with respect to abstraction. The updated Restricted Dependency Chains section starts on page 28. Thoughts? Thanx, Paul
Re: [c++std-parallel-1651] Re: Compilers and RCU readers: Once more unto the breach!
On Tue, May 26, 2015 at 07:08:35PM +0200, Torvald Riegel wrote: On Tue, 2015-05-19 at 17:55 -0700, Paul E. McKenney wrote: http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf I have been discussing Section 7.9 with Paul during last week. While I think that 7.9 helps narrow down the problem somewhat, I'm still concerned that it effectively requires compilers to either track dependencies or conservatively prevent optimizations like value speculation and specialization based on that. Neither is a good thing for a compiler. I do believe that we can find some middle ground. 7.9 adds requirements that dependency chains stop if the program itself informs the compiler about the value of something in the dependency chain (e.g., as shown in Figure 33). Also, if a correct program that does not have undefined behavior must use a particular value, this is also considered as informing the compiler about that value. For example: int arr[2]; int* x = foo.load(mo_consume); if (x arr) // implies same object/array, so x is in arr[] int r1 = *x; // compiler knows x == arr + 1 The program, assuming no undefined behavior, first tells the compiler that x should be within arr, and then the comparison tells the compiler that x!=arr, so x==arr+1 must hold because there are just two elements in the array. The updated version of Section 7.9 says that if undefined behavior allows the compiler to deduce the exact pointer value, as in the case you show above, the dependency chain is broken. Having these requirements is generally good, but we don't yet know how to specify this properly. For example, I suppose we'd need to also say that the compiler cannot assume to know anything about a value returned from an mo_consume load; otherwise, nothing prevents a compiler from using knowledge about the stores that the mo_consume load can read from (see Section 7.2). I expect that the Linux kernel's rcu_dereference() primitives would map to volatile memory_order_consume loads for this reason. Also, a compiler is still required to not do value-speculation or optimizations based on that. For example, this program: void op(type *p) { foo /= p-a; bar = p-b; } void bar() { pointer = ppp.load(mo_consume); op(pointer); } ... must not be transformed into this program, even if the compiler knows that global_var-a == 1: void op(type *p) { /* unchanged */} void bar() { pointer = ppp.load(mo_consume); if (pointer != global_var) { op(pointer); else // specialization for global_var { // compiler knows global_var-a==1; // compiler uses global_var directly, inlines, optimizes: bar = global_var-b; } The compiler could optimize out the division if pointer==global_var but it must not access field b directly through global_var. This would be pretty awkwaard; the compiler may work based on an optimized expression in the specialization (ie, create code that assumes global_var instead of pointer) but it would still have to carry around and use the non-optimized expression. Exactly how valuable is this sort of optimization in real code? And how likely is the compiler to actually be able to apply it? (I nevertheless will take another pass through the Linux kernel looking for global variables being added to RCU-protected linked structures. Putting a global into an RCU-protected structure seems more likely than is an RCU-protected pointer into a two-element array.) This wouldn't be as bad if it were easily constrained to code sequences that really need the dependencies. However, 7.9 does not effectively contain dependencies to only the code that really needs them, IMO. Unless the compiler proves otherwise, it has to assume that a load from a pointer carries a dependency. Proving that is often very hard because it requires points-to analysis; 7.9 restricts this to intra-thread analysis but that is still nontrivial. Michael Matz' had a similar concern (in terms of what it results in). Again, I will be looking through the Linux kernel for vulnerabilities to this sort of transformation. However, I am having a very hard time seeing how the compiler is going to know that much about the vast majority of the Linux-kernel use cases. The linked structures are allocated on the heap, not in arrays or in globals. Given that mo_consume is useful but a very specialized feature, I wouldn't be optimistic that 7.9 would actually be supported by many compilers. The trade-off between having to track dependencies or having to disallow optimizations is a bad one to make. The simple way out for a compiler would be to just emit mo_acquire instead of mo_consume and be done with all -- and this might be the most practical decision overall, or the default general-purpose implementation. At least I haven't heard any compiler implementer say that they think it's obviously worth implementing. I believe that we need
Re: Compilers and RCU readers: Once more unto the breach!
On Fri, May 22, 2015 at 08:43:44AM +0200, Ingo Molnar wrote: * Linus Torvalds torva...@linux-foundation.org wrote: (a) the official rules are completely pointless, and make sense only because the standard is written for some random abstract machine that doesn't actually exist. Presuming the intent of the abstract machine specification is to avoid being seen as biased towards any specific machine (politics), maybe write this as: (a) the official rules are written for a somewhat weird and complex union of all known and theoretically possible CPU architectures that exist or which might exist in the future, which machine does not actually exist in practice, but which allows a single abstract set of rules to apply to all machines. These rules are complex, but if applied to a specific machine they become considerably simpler. Here's a few examples: ... ? (Assuming it's a goal of this standard to be human parseable to more than a few dozen people on the planet.) Should something based on Section 7.9 go in, then I would need to add a more developer-friendly explanation in Documentation/RCU, no two ways about it! ;-) Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Fri, May 22, 2015 at 06:43:32AM -0400, Richard Kenner wrote: (Assuming it's a goal of this standard to be human parseable to more than a few dozen people on the planet.) Unfortunately, that's rarely a goal of most standards. ;-) My experience does match Richard's, sad to say. There has been some vigorous discussion in another thread involving undefined behavior and value narrowing, which has resulted in some useful changes to Section 7.9. The updated document is here: http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.22a.pdf Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Fri, May 22, 2015 at 06:30:29PM +0100, Will Deacon wrote: Hi Paul, On Thu, May 21, 2015 at 09:02:12PM +0100, Paul E. McKenney wrote: On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote: On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote: On to #5: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); r2 = atomic_load_explicit(y, memory_order_consume); if (r2 == 42) atomic_store_explicit(x, 42, memory_order_relaxed); The first thread's accesses are dependency ordered. The second thread's ordering is in a corner case that memory-barriers.txt does not cover. You are supposed to start control dependencies with READ_ONCE_CTRL(), not a memory_order_consume load (AKA rcu_dereference and friends). However, Alpha would have a full barrier as part of the memory_order_consume load, and the rest of the processors would (one way or another) respect the control dependency. And the compiler would have some fun trying to break it. But this is interesting because the first thread is ordered whilst the second is not, so doesn't that effectively forbid the compiler from constant-folding values if it can't prove that there is no dependency chain? You lost me on this one. Are you suggesting that the compiler speculate the second thread's atomic store? That would be very bad regardless of dependency chains. So what constant-folding optimization are you thinking of here? If the above example is not amenable to such an optimization, could you please give me an example where constant folding would apply in a way that is sensitive to dependency chains? Unless I'm missing something, I can't see what would prevent a compiler from looking at the code in thread1 and transforming it into the code in thread2 (i.e. constant folding r1 with 42 given that the taken branch must mean that r1 == 42). However, such an optimisation breaks the dependency chain, which means that a compiler needs to walk backwards to see if there is a dependency chain extending to r1. Indeed! Which is one reason that (1) integers are not allowed in dependency chains with a very few extremely constrained exceptions and (2) sequences of comparisons and/or undefined-behavior considerations that allow the compiler to exactly determine the pointer value break the dependency chain. So the current Linux memory model would allow (r1 == 42 r2 == 42), but I don't know of any hardware/compiler combination that would allow it. And no, I am -not- going to update memory-barriers.txt for this litmus test, its theoretical interest notwithstanding! ;-) Of course, I'm not asking for that at all! I'm just trying to see how your proposal holds up with the example. Whew! ;-) Thanx, Paul
Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!
On Thu, May 21, 2015 at 06:17:43PM +0200, Michael Matz wrote: Hi, On Thu, 21 May 2015, Paul E. McKenney wrote: The point is -exactly- to codify the current state of affairs. Ah, I see, so it's not yet about creating a more useful (for compilers, that is) model. There are several approaches being considered for that as well, but we do need to codify current usage. char * fancy_assign (char *in) { return in; } ... char *x, *y; x = atomic_load_explicit(p, memory_order_consume); y = fancy_assign (x); atomic_store_explicit(q, y, memory_order_relaxed); So, is there, or is there not a dependency carried from x to y in your proposed model (and which rule in your document states so)? Clearly, without any other language the compiler would have to assume that there is (because the equivalent 'y = x' assignment would carry the dependency). The dependency is not carried, though this is due to the current set of rules not covering atomic loads and stores, which I need to fix. Okay, so with the current regime(s), the dependency carries ... Yes, that is the intent. o Rule 14 says that if a value is part of a dependency chain and is used as the actual parameter of a function call, then the dependency chain extends to the corresponding formal parameter, namely in of fancy_assign(). o Rule 15 says that if a value is part of a dependency chain and is returned from a function, then the dependency chain extends to the returned value in the calling function. o And you are right. I need to make the first and second rules cover the relaxed atomic operations, or at least atomic loads and stores. Not that this is an issue for existing Linux-kernel code. But given such a change, the new version of rule 2 would extend the dependency chain to cover the atomic_store_explicit(). ... (if this detail would be fixed). Okay, that's quite awful ... If it has to assume this, then the whole model is not going to work very well, as usual with models that assume a certain less-optimal fact (carries-dep is less optimal for code generation purposes that not-carries-dep) unless very specific circumstances say it can be ignored. Although that is a good general rule of thumb, I do not believe that it applies to this situation, with the exception that I do indeed assume that no one is insane enough to do value-speculation optimizations for non-NULL values on loads from pointers. So what am I missing here? ... because you are then missing that if carries-dep can flow through function calls from arguments to return values by default, the compiler has to assume this in fact always happens when it can't see the function body, or can't analyze it. In effect that's making the whole carries-dep stops at these and those uses a useless excercise because a malicious user (malicious in the sense of abusing the model to show that it's hindering optimizations), i.e. me, can hide all such carries-dep stopping effects inside a function, et voila, the dependecy carries through. So for a slightly more simple example: extern void *foo (void *); // body not available x = load y = foo (x); store (y) the compiler has to assume that there's a dep-chain from x to y; always. Yes, the compiler does have to make this assumption. And the intent behind the rules is to ensure that this assumption does not get in the way of reasonable optimizations. So although I am sure that you are as busy as the rest of us, I really do need you to go through the rules in detail before you get too much more excited about this. What's worse, it also has to assume a carries-dep for this: extern void foo (void *in, void **out1, void **out2); x = load foo (x, o1, o2); store (o1); store (o2); Now the compiler has to assume that the body of 'foo' is just mean enough to make the dep-chain carry from in to *out1 or *out2 (i.e. it has to assume that for both). This extends to _all_ memory accessible from foo's body, i.e. generally all global and all local address-taken variables, so as soon as you have a function call into which a dep-chain value flows you're creating a dep-chain extension from that value to each and every global piece of memory, because the compiler cannot assume that the black box called foo is not mean. This could conceivably be stopped by making normal stores not to carry the dependency; then only the return value might be infected; but I don't see that in your rules, as a normal store is just an assigment in your model and hence rules 1 and 2 apply (that is, carries-dep flows through all assignments, incl. loads and stores). Basically whenever you can construct black boxes for the compiler, you have to limit their effects on such transitive relations like carries-dep by default
Re: Compilers and RCU readers: Once more unto the breach!
On Thu, May 21, 2015 at 08:24:22PM +0100, Will Deacon wrote: On Wed, May 20, 2015 at 07:16:06PM +0100, Paul E. McKenney wrote: On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote: On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote: Indeed, something like this does -not- carry a dependency from the memory_order_consume load to q: char *p, q; p = atomic_load_explicit(gp, memory_order_consume); q = gq + (intptr_t)p - (intptr_t)p; If this was compiled with -O0, ARM and Power might well carry a dependency, but given any optimization, the assembly language would have no hint of any such dependency. So I am not seeing any particular danger. The above is a welcome relaxation over C11, since ARM doesn't even give you ordering based off false data dependencies. My concern is more to do with how this can be specified precisely without prohibing honest compiler and hardware optimisations. That last is the challenge. I believe that I am pretty close, but I am sure that additional adjustment will be required. Especially given that we also need the memory model to be amenable to formal analysis. Well, there's still the whole thin-air problem which unfortunately doesn't go away with your proposal... (I was hoping that differentiating between true and false dependencies would solve that, but your set of rules isn't broad enough and I don't blame you at all for that!). Out of interest, how do you tackle examples (4) and (5) of (assuming the reads are promoted to consume loads)?: http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html my understanding is that you permit both outcomes (I appreciate you're not directly tackling out-of-thin-air, but treatment of dependencies is heavily related). Thanks for taking the time to walk these two examples through. ;-) ;-) ;-) Let's see... #4 is as follows, given promotion to memory_order_consume and (I am guessing) memory_order_relaxed: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); -- r2 = atomic_load_explicit(y, memory_order_consume); if (r2 == 42) atomic_store_explicit(x, 42, memory_order_relaxed); else atomic_store_explicit(x, 42, memory_order_relaxed); The second thread does not have a proper control dependency, even with the memory_order_consume load because both branches assign the same value to x. This means that the compiler is within its rights to optimize this into the following: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); -- r2 = atomic_load_explicit(y, memory_order_consume); atomic_store_explicit(x, 42, memory_order_relaxed); There is no dependency between the second thread's pair of statements, so both the compiler and the CPU are within their rights to optimize further as follows: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); -- atomic_store_explicit(x, 42, memory_order_relaxed); r2 = atomic_load_explicit(y, memory_order_consume); If the compiler makes this final optimization, even mythical SC hardware is within its rights to end up with (r1 == 42 r2 == 42). Which is fine, as far as I am concerned. Or at least something that can be lived with. Agreed. On to #5: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); r2 = atomic_load_explicit(y, memory_order_consume); if (r2 == 42) atomic_store_explicit(x, 42, memory_order_relaxed); The first thread's accesses are dependency ordered. The second thread's ordering is in a corner case that memory-barriers.txt does not cover. You are supposed to start control dependencies with READ_ONCE_CTRL(), not a memory_order_consume load (AKA rcu_dereference and friends). However, Alpha would have a full barrier as part of the memory_order_consume load, and the rest of the processors would (one way or another) respect the control dependency. And the compiler would have some fun trying to break it. But this is interesting because the first thread is ordered whilst the second is not, so doesn't that effectively forbid the compiler from constant-folding values if it can't prove that there is no dependency chain? You lost me on this one. Are you suggesting that the compiler speculate the second thread's atomic store? That would
Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!
On Thu, May 21, 2015 at 04:22:38PM +0200, Michael Matz wrote: Hi, On Wed, 20 May 2015, Paul E. McKenney wrote: I'm not sure... you'd require the compiler to perform static analysis of loops to determine the state of the machine when they exit (if they exit!) in order to show whether or not a dependency is carried to subsequent operations. If it can't prove otherwise, it would have to assume that a dependency *is* carried, and it's not clear to me how it would use this information to restrict any subsequent dependency removing optimisations. It'd just convert consume to acquire. It should not need to, actually. [with GCC hat, and having only lightly read your document] Understood. Then you need to provide language or at least informal reasons why the compiler is allowed to not do that. Without that a compiler would have to be conservative, if it can't _prove_ that a dependency chain is stopped, then it has to assume it hasn't. For instance I can't really make out easily what your document says about the following simple situation (well, actually I have difficulties to differ between what you're proposing as the good-new model of this all, and what you're merely describing as different current states of affair): The point is -exactly- to codify the current state of affairs. I expect a follow-on effort to specify some sort of marking regimen, as noted in the last paragraph of 7.9 and as discussed with Torvald Riegel. However, given that there are not yet any implementations or practical experience with such markings, I suspect that some time will be required to hammer out a good marking scheme. char * fancy_assign (char *in) { return in; } ... char *x, *y; x = atomic_load_explicit(p, memory_order_consume); y = fancy_assign (x); atomic_store_explicit(q, y, memory_order_relaxed); So, is there, or is there not a dependency carried from x to y in your proposed model (and which rule in your document states so)? Clearly, without any other language the compiler would have to assume that there is (because the equivalent 'y = x' assignment would carry the dependency). The dependency is not carried, though this is due to the current set of rules not covering atomic loads and stores, which I need to fix. Here is the sequence of events: o A memory_order_consume load heads a dependency chain. o Rule 2 says that if a value is part of a dependency chain and is used as the right-hand side of an assignment operator, the expression extends the chain to cover the assignment. And I switched to numbered bullet items here: http://www2.rdrop.com/users/paulmck/RCU/consume.2015.05.21a.pdf o Rule 14 says that if a value is part of a dependency chain and is used as the actual parameter of a function call, then the dependency chain extends to the corresponding formal parameter, namely in of fancy_assign(). o Rule 15 says that if a value is part of a dependency chain and is returned from a function, then the dependency chain extends to the returned value in the calling function. o And you are right. I need to make the first and second rules cover the relaxed atomic operations, or at least atomic loads and stores. Not that this is an issue for existing Linux-kernel code. But given such a change, the new version of rule 2 would extend the dependency chain to cover the atomic_store_explicit(). If it has to assume this, then the whole model is not going to work very well, as usual with models that assume a certain less-optimal fact (carries-dep is less optimal for code generation purposes that not-carries-dep) unless very specific circumstances say it can be ignored. Although that is a good general rule of thumb, I do not believe that it applies to this situation, with the exception that I do indeed assume that no one is insane enough to do value-speculation optimizations for non-NULL values on loads from pointers. So what am I missing here? Do you have a specific example where the compiler would need to suppress a production-quality optimization? Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Thu, May 21, 2015 at 01:42:11PM -0700, Linus Torvalds wrote: On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: The compiler can (and does) speculate non-atomic non-volatile writes in some cases, but I do not believe that it is permitted to speculate either volatile or atomic writes. I do *not* believe that a compiler is ever allowed to speculate *any* writes - volatile or not - unless the compiler can prove that the end result is either single-threaded, or the write in question is guaranteed to only be visible in that thread (ie local stack variable etc). Quite frankly, I'd be much happier if the C standard just said so outright. So would I. ;-) The usual example is the following, where x is non-volatile and non-atomic: if (a) x = 42; else x = 7; The current rules allow this to be transformed as follows: x = 7; if (a) x = 42; So the variable x takes on the value 7 momentarily when it otherwise would not have. At least C11 now prohibits drive-by speculative writes, where the compiler writes a larger size and then fixes things up afterwards. Also, I do think that the whole consume read should be explained better to compiler writers. Right now the language (including very much in the restricted dependency model) is described in very abstract terms. Yet those abstract terms are actually very subtle and complex, and very opaque to a compiler writer. If I was a compiler writer, I'd absolutely detest that definition. It's very far removed from my problem space as a compiler writer, and nothing in the language *explains* the odd and subtle abstract rules. It smells ad-hoc to me. Now, I actually understand the point of those odd and abstract rules, but to a compiler writer that doesn't understand the background, the whole section reads as this is really painful for me to track all those dependencies and what kills them. So I would very much suggest that there would be language that *explains* this. Basically, tell the compiler writer: (a) the official rules are completely pointless, and make sense only because the standard is written for some random abstract machine that doesn't actually exist. (b) in *real life*, the whole and only point of the rules is to make sure that the compiler doesn't turn a data depenency into a control dependency, which on ARM and POWERPC does not honor causal memory ordering (c) on x86, since *all* memory accesses are causal, all the magical dependency rules are just pointless anyway, and what it really means is that you cannot re-order accesses with value speculation. (c) the *actual* relevant rule for a compiler writer is very simple: the compiler must not do value speculation on a consume load, and the abstract machine rules are written so that any other sane optimization is legal. (d) if the compiler writer really thinks they want to do value speculation, they have to turn the consume load into an acquire load. And you have to do that anyway on architectures like alpha that aren't causal even for data dependencies. I am certainly more than willing to add this sort of wording to the document. I personally think the whole abstract machine model of the C language is a mistake. It would be much better to talk about things in terms of actual code generation and actual issues. Make all the problems much more concrete, with actual examples of how memory ordering matters on different architectures. 99% of all the problems with the whole consume memory ordering comes not from anything relevant to a compiler writer. All of it comes from trying to define the issue in the wrong terms. I certainly cannot deny that there seems to be significant confusion in the discussion thus far. Thanx, Paul
Re: [c++std-parallel-1614] Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 11:03:00AM +0200, Richard Biener wrote: On Wed, May 20, 2015 at 9:34 AM, Jens Maurer jens.mau...@gmx.net wrote: On 05/20/2015 04:34 AM, Paul E. McKenney wrote: On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote: - the you can add/subtract integral values still opens you up to language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the dependency, which it clearly doesn't. But language-lawyering it does, since all those operations (cast to pointer, cast to integer, subtracting an integer) claim to be dependency-preserving operations. [...] There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7, but in that case, if the resulting pointer happens by chance to reference valid memory, I believe a dependency would still be carried. [...] From a language lawyer standpoint, pointer arithmetic is only valid within an array. These examples seem to go beyond the bounds of the array and therefore have undefined behavior. C++ standard section 5.7 paragraph 4 If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. C99 and C11 identical phrasing in 6.5.6 paragraph 8 Of course you can try to circumvent that by doing (char*)((intptr_t)ptr - (intptr_t)ptr + (intptr_t)ptr) (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65752 for extra fun). Which (IMHO) gets you into the standard language that only makes conversion of the exact same integer back to a pointer well-defined(?) I am feeling good about leaving the restriction and calling out the two paragraphs in a footnote, then. ;-) Thanx, Paul
Re: [c++std-parallel-1616] Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 09:34:10AM +0200, Jens Maurer wrote: On 05/20/2015 04:34 AM, Paul E. McKenney wrote: On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote: - the you can add/subtract integral values still opens you up to language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the dependency, which it clearly doesn't. But language-lawyering it does, since all those operations (cast to pointer, cast to integer, subtracting an integer) claim to be dependency-preserving operations. [...] There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7, but in that case, if the resulting pointer happens by chance to reference valid memory, I believe a dependency would still be carried. [...] From a language lawyer standpoint, pointer arithmetic is only valid within an array. These examples seem to go beyond the bounds of the array and therefore have undefined behavior. C++ standard section 5.7 paragraph 4 If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined. C99 and C11 identical phrasing in 6.5.6 paragraph 8 Even better! I added a footnote calling out these two paragraphs. Thax, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote: Hi Paul, On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote: On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote: On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: So I think you're better off just saying that operations designed to drop significant bits break the dependency chain, and give things like 1 and (char *)ptr-(uintptr_t)ptr as examples of such. Making that just an extension of your existing 0 language would seem to be natural. Works for me! I added the following bullet to the list of things that break dependencies: If a pointer is part of a dependency chain, and if the values added to or subtracted from that pointer cancel the pointer value so as to allow the compiler to precisely determine the resulting value, then the resulting value will not be part of any dependency chain. For example, if p is part of a dependency chain, then ((char *)p-(uintptr_t)p)+65536 will not be. Seem reasonable? Whilst I understand what you're saying (the ARM architecture makes these sorts of distinctions when calling out dependency-based ordering), it feels like we're dangerously close to defining the difference between a true and a false dependency. If we want to do this in the context of the C language specification, you run into issues because you need to evaluate the program in order to determine data values in order to determine the nature of the dependency. Indeed, something like this does -not- carry a dependency from the memory_order_consume load to q: char *p, q; p = atomic_load_explicit(gp, memory_order_consume); q = gq + (intptr_t)p - (intptr_t)p; If this was compiled with -O0, ARM and Power might well carry a dependency, but given any optimization, the assembly language would have no hint of any such dependency. So I am not seeing any particular danger. You tackle this above by saying to allow the compiler to precisely determine the resulting value, but I can't see how that can be cleanly fitted into something like the C language specification. I am sure that there will be significant rework from where this document is to language appropriate from the standard. Which is why I am glad that Jens is taking an interest in this, as he is particularly good at producing standards language. Even if it can, then we'd need to reword the ?: treatment that you currently have: If a pointer is part of a dependency chain, and that pointer appears in the entry of a ?: expression selected by the condition, then the chain extends to the result. which I think requires the state of the condition to be known statically if we only want to extend the chain from the selected expression. In the general case, wouldn't a compiler have to assume that the chain is extended from both? In practice, yes, if the compiler cannot determine which expression is selected, it must arrange for the dependency to be carried from either, depending on the run-time value of the condition. But you would have to work pretty hard to create code that did not carry the dependencies as require, not? Additionally, what about the following code? char *x = y ? z : z; Does that extend a dependency chain from z to x? If so, I can imagine a CPU breaking that in practice. I am not seeing this. I would expect the compiler to optimize to something like this: char *x = z; How does this avoid carrying the dependency? Or are you saying that ARM loses the dependency via a store to memory and a later reload? That would be a bit surprising... Humans will understand, and compiler writers won't care. They will either depend on hardware semantics anyway (and argue that your language is tight enough that they don't need to do anything special) or they will turn the consume into an acquire (on platforms that have too weak hardware). Agreed. Plus Core Working Group will hammer out the exact wording, should this approach meet their approval. For the avoidance of doubt, I'm completely behind any attempts to tackle this problem, but I anticipate an uphill struggle getting this text into the C standard. Is your intention to change the carries-a-dependency relation to encompass this change? I completely agree that this won't be easy, but this is the task at hand. And yes, the intent is to change carries-a-dependency, given that the current wording isn't helping anything. ;-) Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 02:18:37PM +0100, David Howells wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Additionally, what about the following code? char *x = y ? z : z; Does that extend a dependency chain from z to x? If so, I can imagine a CPU breaking that in practice. I am not seeing this. I would expect the compiler to optimize to something like this: char *x = z; Why? What if y has a potential side-effect (say it makes a function call)? I was thinking of y as a simple variable, but if it is something more complex, then the compiler could do this, right? char *x; y; x = z; Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote: On 20/05/15 14:37, David Howells wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: I was thinking of y as a simple variable, but if it is something more complex, then the compiler could do this, right? char *x; y; x = z; Yeah. I presume it has to maintain the ordering, though. The scheduler for e.g. is free to reorder if it can prove there is no dependence (or indeed side-effects for y) between insns produced for y and `x = z'. So for example, if y is independent of z, the compiler can do the following: char *x; x = z; y; But the dependency ordering is still maintained from z to x, so this is not a problem. Or am I missing something subtle here? Thanx, Paul
Re: [c++std-parallel-1624] Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 02:37:05PM +0100, David Howells wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: I was thinking of y as a simple variable, but if it is something more complex, then the compiler could do this, right? char *x; y; x = z; Yeah. I presume it has to maintain the ordering, though. Agreed. Unless of course y writes to x or some such. Given that there is already code in the Linux kernel relying on dependencies being carried through stores to local variables, this should not be a problem. Or am I missing something? Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 04:46:17PM +0100, Will Deacon wrote: On Wed, May 20, 2015 at 01:15:22PM +0100, Paul E. McKenney wrote: On Wed, May 20, 2015 at 12:47:45PM +0100, Will Deacon wrote: On Wed, May 20, 2015 at 03:41:48AM +0100, Paul E. McKenney wrote: If a pointer is part of a dependency chain, and if the values added to or subtracted from that pointer cancel the pointer value so as to allow the compiler to precisely determine the resulting value, then the resulting value will not be part of any dependency chain. For example, if p is part of a dependency chain, then ((char *)p-(uintptr_t)p)+65536 will not be. Seem reasonable? Whilst I understand what you're saying (the ARM architecture makes these sorts of distinctions when calling out dependency-based ordering), it feels like we're dangerously close to defining the difference between a true and a false dependency. If we want to do this in the context of the C language specification, you run into issues because you need to evaluate the program in order to determine data values in order to determine the nature of the dependency. Indeed, something like this does -not- carry a dependency from the memory_order_consume load to q: char *p, q; p = atomic_load_explicit(gp, memory_order_consume); q = gq + (intptr_t)p - (intptr_t)p; If this was compiled with -O0, ARM and Power might well carry a dependency, but given any optimization, the assembly language would have no hint of any such dependency. So I am not seeing any particular danger. The above is a welcome relaxation over C11, since ARM doesn't even give you ordering based off false data dependencies. My concern is more to do with how this can be specified precisely without prohibing honest compiler and hardware optimisations. That last is the challenge. I believe that I am pretty close, but I am sure that additional adjustment will be required. Especially given that we also need the memory model to be amenable to formal analysis. Out of interest, how do you tackle examples (4) and (5) of (assuming the reads are promoted to consume loads)?: http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html my understanding is that you permit both outcomes (I appreciate you're not directly tackling out-of-thin-air, but treatment of dependencies is heavily related). Let's see... #4 is as follows, given promotion to memory_order_consume and (I am guessing) memory_order_relaxed: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); -- r2 = atomic_load_explicit(y, memory_order_consume); if (r2 == 42) atomic_store_explicit(x, 42, memory_order_relaxed); else atomic_store_explicit(x, 42, memory_order_relaxed); The second thread does not have a proper control dependency, even with the memory_order_consume load because both branches assign the same value to x. This means that the compiler is within its rights to optimize this into the following: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); -- r2 = atomic_load_explicit(y, memory_order_consume); atomic_store_explicit(x, 42, memory_order_relaxed); There is no dependency between the second thread's pair of statements, so both the compiler and the CPU are within their rights to optimize further as follows: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); -- atomic_store_explicit(x, 42, memory_order_relaxed); r2 = atomic_load_explicit(y, memory_order_consume); If the compiler makes this final optimization, even mythical SC hardware is within its rights to end up with (r1 == 42 r2 == 42). Which is fine, as far as I am concerned. Or at least something that can be lived with. On to #5: r1 = atomic_load_explicit(x, memory_order_consume); if (r1 == 42) atomic_store_explicit(y, r1, memory_order_relaxed); r2 = atomic_load_explicit(y, memory_order_consume); if (r2 == 42) atomic_store_explicit(x, 42, memory_order_relaxed); The first thread's accesses are dependency ordered. The second thread's ordering is in a corner case that memory-barriers.txt does not cover. You are supposed to start control dependencies with READ_ONCE_CTRL(), not a memory_order_consume load (AKA rcu_dereference and friends). However, Alpha would have a full barrier as part
Re: [c++std-parallel-1632] Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 04:54:51PM +0100, Andrew Haley wrote: On 05/20/2015 04:46 PM, Will Deacon wrote: I'm not sure... you'd require the compiler to perform static analysis of loops to determine the state of the machine when they exit (if they exit!) in order to show whether or not a dependency is carried to subsequent operations. If it can't prove otherwise, it would have to assume that a dependency *is* carried, and it's not clear to me how it would use this information to restrict any subsequent dependency removing optimisations. It'd just convert consume to acquire. It should not need to, actually. Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Wed, May 20, 2015 at 03:15:48PM +0100, Ramana Radhakrishnan wrote: On 20/05/15 15:03, Paul E. McKenney wrote: On Wed, May 20, 2015 at 02:44:30PM +0100, Ramana Radhakrishnan wrote: On 20/05/15 14:37, David Howells wrote: Paul E. McKenney paul...@linux.vnet.ibm.com wrote: I was thinking of y as a simple variable, but if it is something more complex, then the compiler could do this, right? char *x; y; x = z; Yeah. I presume it has to maintain the ordering, though. The scheduler for e.g. is free to reorder if it can prove there is no dependence (or indeed side-effects for y) between insns produced for y and `x = z'. So for example, if y is independent of z, the compiler can do the following: char *x; x = z; y; But the dependency ordering is still maintained from z to x, so this is not a problem. Well, reads if any of x (assuming x was initialized elsewhere) would need to happen before x got assigned to z. Agreed, there needs to be a memory_order_consume load up there somewhere. (AKA rcu_dereference().) I understood the original maintain the ordering as between the statements `x = z' and `y'. Ah, I was assuming between x and z. David, what was your intent? ;-) Or am I missing something subtle here? No, it sounds like we are on the same page here. Whew! ;-) Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Tue, May 19, 2015 at 06:57:02PM -0700, Linus Torvalds wrote: On Tue, May 19, 2015 at 5:55 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf From a very quick read-through, the restricted dependency chain in 7.9 seems to be reasonable, and essentially covers thats' what hardware gives us anyway, making compiler writers happy. I would clarify the language somewhat: - it says that the result of a cast of a pointer is a dependency. You need to make an exception for casting to bool, methinks (because that's effectively just a test-against-NULL, which you later describe as terminating the dependency). Maybe get rid of the any type, and just limit it to casts to types of size intptr_t, ie ones that don't drop significant bits. All the other rules talk about [u]intptr_t anyway. Excellent point! I now say: If a pointer is part of a dependency chain, then casting it (either explicitly or implicitly) to any pointer-sized type extends the chain to the result. If this approach works out, the people in the Core Working Group will come up with alternative language-lawyer-proof wording, but this informal version will hopefully do for the moment. - you clarify that the trivial 0 and | ~0 kill the dependency chain, but if you really want to be a stickler, you might want to extend it to a few more cases. Things like 1 (to extract a tag from the lot bit of a tagged pointer) should likely also drop the dependency, since a compiler would commonly end up using the end result as a conditional even if the code was written to then use casting etc to look like a dereference. Ah, how about the following? If a value of type intptr_t or uintptr_t is part of a dependency chain, and if that value is one of the operands to an or | infix operator whose result has too few or too many bits set, then the resulting value will not be part of any dependency chain. For example, on a 64-bit system, if p is part of a dependency chain, then (p 0x7) provides just the tag bits, and normally cannot even be legally dereferenced. Similarly, (p | ~0) normally cannot be legally dereferenced. - the you can add/subtract integral values still opens you up to language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the dependency, which it clearly doesn't. But language-lawyering it does, since all those operations (cast to pointer, cast to integer, subtracting an integer) claim to be dependency-preserving operations. My thought was that the result of (char *)ptr - (intptr_t)ptr is a NULL pointer in most environments, and dereferencing a NULL pointer is undefined behavior. So it becomes irrelevant whether or not the NULL pointer carries a dependency. There are some stranger examples, such as (char *)ptr - ((intptr_t)ptr)/7, but in that case, if the resulting pointer happens by chance to reference valid memory, I believe a dependency would still be carried. Of course, if you are producing code like that, I am guessing that dependencies are the very least of your concerns. However, I will give this some more thought. So I think you want to limit the logical operators to things that don't mask off too many bits, and you should probably limit the add/subtract operations some way (maybe specify that the integer value you add/subtract cannot be related to the pointer). But I think limiting it to mostly pointer ops (and a _few_ integer operations to do offsets and remove tag bits) is otherwise a good approach. Glad you mostly like it! ;-) Thanx, Paul
Re: Compilers and RCU readers: Once more unto the breach!
On Tue, May 19, 2015 at 07:10:12PM -0700, Linus Torvalds wrote: On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: - the you can add/subtract integral values still opens you up to language lawyers claiming (char *)ptr - (intptr_t)ptr preserving the dependency, which it clearly doesn't. But language-lawyering it does, since all those operations (cast to pointer, cast to integer, subtracting an integer) claim to be dependency-preserving operations. So I think you want to limit the logical operators to things that don't mask off too many bits, and you should probably limit the add/subtract operations some way (maybe specify that the integer value you add/subtract cannot be related to the pointer). Actually, not related doesn't work. For some buddy allocator thing, you very much might want some of the bits to be related. Good point, you could do the buddy-allocator computations with add and subtract instead of exclusive OR. So I think you're better off just saying that operations designed to drop significant bits break the dependency chain, and give things like 1 and (char *)ptr-(uintptr_t)ptr as examples of such. Making that just an extension of your existing 0 language would seem to be natural. Works for me! I added the following bullet to the list of things that break dependencies: If a pointer is part of a dependency chain, and if the values added to or subtracted from that pointer cancel the pointer value so as to allow the compiler to precisely determine the resulting value, then the resulting value will not be part of any dependency chain. For example, if p is part of a dependency chain, then ((char *)p-(uintptr_t)p)+65536 will not be. Seem reasonable? Humans will understand, and compiler writers won't care. They will either depend on hardware semantics anyway (and argue that your language is tight enough that they don't need to do anything special) or they will turn the consume into an acquire (on platforms that have too weak hardware). Agreed. Plus Core Working Group will hammer out the exact wording, should this approach meet their approval. Thanx, Paul
Compilers and RCU readers: Once more unto the breach!
Hello! Following up on last year's discussion (https://lwn.net/Articles/586838/, https://lwn.net/Articles/588300/), I believe that we have a solution. If I am wrong, I am sure you all will let me know, and in great detail. ;-) The key simplification is to just say no to RCU-protected array indexes: https://lkml.org/lkml/2015/5/12/827, as was suggested by several people. This simplification means that rcu_dereference (AKA memory_order_consume) need only return pointers. This in ture avoids things like (x-x), (x*0), and (x%1) because if x is a pointer, these expressions either return non-pointers are compilation errors. With a very few exceptions, dependency chains can lead -to- non-pointers, but cannot pass -through- them. The result is that dependencies are carried only by operations for which the compiler cannot easily optimize the away those dependencies, these operations including simple assignment, integer offset (including indexing), dereferencing, casts, passing as a function argument, return values from functions and so on. A complete list with commentary starts on page 28 of: http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf Dependency chains are broken if a pointer compares equal to some other pointer not part of the same dependency chain, if too many bits are ORed onto or ANDed off of a intptr_t or uintptr_t, or if the dependency is explicitly killed (which should now strictly speaking never be necessary, but which might allow better diagnostics). These are set out in more detail on page 30 of the above PDF. This covers all the uses in the Linux kernel that I am aware of without any source-code changes (other than to the rcu_dereference() primitives themselves) and should also work for compilers and standards. Thoughts? Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote: xagsmtp5.20140307174618.3...@vmsdvm6.vnet.ibm.com X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6) On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. I am a bit nervous about a silent implicit conversion from vdp to non-vdp in the general case. Why are you nervous about it? If someone expects the vdp to propagate into some function that might be compiled with aggressive optimizations that break this expectation, it would be good for that someone to know about it. Ah! I am assuming that the compiler is -not- emitting memory barriers at vdp-to-non-vdp transitions. In that case, warnings are even more important -- without the warnings, it is a real pain chasing these unnecessary memory barriers out of the code. So we are -not- in the business of emitting memory barriers on vdp-to-non-vdp transitions, right? However, when the result is being used by a conditional, the silent implicit conversion makes a lot of sense. Is that distinction something that the compiler can handle easily? I think so. I'm not a language lawyer, but we have other such conversions in the standard (e.g., int to boolean, between int and float) and I currently don't see a fundamental difference to those. But we'll have to ask
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote: On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: xagsmtp3.20140305162928.8...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. I am a bit nervous about a silent implicit conversion from vdp to non-vdp in the general case. However, when the result is being used by a conditional, the silent implicit conversion makes a lot of sense. Is that distinction something that the compiler can handle easily? On the other hand, silent implicit conversion from non-vdp to vdp is very useful for common code that can be invoked both by RCU readers and by updaters. ?: could be somewhat special, in that the type depends on the 2nd and 3rd operand. Thus, vdp x = non-vdp ? vdp : vdp; should be allowed, whereas vdp x = non-vdp ? non-vdp : vdp; probably should be disallowed if we don't provide for implicit casts from non-vdp to vdp. Actually, from the Linux-kernel code that I am seeing, we want to be able to silently convert from non-vdp to vdp in order to permit common code that is invoked from both RCU readers (vdp) and updaters (often non-vdp). This common code must be compiled conservatively to allow vdp, but should be just find with non-vdp. Going through the combinations... 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */ 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */ 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */ 3. vdp x = vdp ? non-vdp : non-vdp; /* Silent conversion. */ 4. vdp x = non-vdp ? vdp : vdp; /* OK, matches. */ 5. vdp x = non-vdp ? vdp : non-vdp; /* Silent conversion. */ 6. vdp x = non-vdp ? non-vdp : vdp; /* Silent conversion. */ 7. vdp x = non-vdp ? non-vdp : non-vdp; /* Silent conversion. */ 8. non-vdp x = vdp ? vdp : vdp
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force the || operators to be evaluated arithmetically because the entire function is an if condition, correct? That's a good question, and one that as far as I understand currently, essentially boils down to whether we want to have tight restrictions on which operations are still vdp. If we look at the different combinations, then it seems we can't decide on whether we have a value-dependency just due to a vdp type: * non-vdp || vdp: vdp iff non-vdp == false * vdp || non-vdp: vdp iff non-vdp == false? * vdp || vdp: always vdp? (and dependency on both?) I'm not sure it makes sense to try to not make
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: xagsmtp2.20140303204700.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +oDo not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +oDo not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? That should work well assuming that things like if, while, and ?: conditions are happy to take a vdp. This assumes that p-a only returns vdp if field a is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(gp, memory_order_consume); if (do_something_with(p-a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(p-refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf-mirrors[disk].rdev); if (r1_bio-bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, rdev-flags) || test_bit(Faulty, rdev-flags)) continue; The fact that the rdev == NULL returns vdp does not force the || operators to be evaluated arithmetically because the entire function is an if condition, correct? Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140303190831.9...@uk1vsc.vnet.ibm.com X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: +o Do not use the results from the boolean and || when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that and || are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators (==, !=, + , =, , or =) when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) I don't think the wording is flawed. We could raise the requirement of having more than one value left for r1 to having more than N with N 1 values left, but the fundamental problem remains in that a compiler could try to generate a (big) switch statement. Instead, I think that this indicates that the value_dep_preserving type modifier would be useful: It would tell the compiler that it shouldn't transform this into a branch in this case, yet allow that optimization for all other code. Understood! BTW, my current task is generating examples using the value_dep_preserving type for RCU-protected array indexes. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Mar 02, 2014 at 11:44:52PM +, Peter Sewell wrote: On 2 March 2014 23:20, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: On 1 March 2014 08:03, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: Hi Paul, On 28 February 2014 18:50, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say yes, we respect that, which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about you should avoid cancellation, and avoid masking with just a small number of bits is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope, expressive enough for RCU idioms, in which that reasoning doesn't allow data/address dependencies to be removed (and hence the hardware guarantee about them will be maintained at the source level). The C11 syntactic notion of dependency, whatever its faults
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(p, memory_order_consume). Thoughts? Thanx, Paul documentation: Record rcu_dereference() value mishandling Recent LKML discussings (see http://lwn.net/Articles/586838/ and http://lwn.net/Articles/588300/ for the LWN writeups) brought out some ways of misusing the return value from rcu_dereference() that are not necessarily completely intuitive. This commit therefore documents what can and cannot safely be done with these values. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index fa57139f50bf..f773a264ae02 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX @@ -12,6 +12,8 @@ lockdep-splat.txt - RCU Lockdep splats explained. NMI-RCU.txt - Using RCU to Protect Dynamic NMI Handlers +rcu_dereference.txt + - Proper care and feeding of return values from rcu_dereference() rcubarrier.txt - RCU and Unloadable Modules rculist_nulls.txt diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 9d10d1db16a5..877947130ebe 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! http://www.openvms.compaq.com/wizard/wiz_2637.html The rcu_dereference() primitive is also an excellent - documentation aid, letting the person reading the code - know exactly which pointers are protected by RCU. + documentation aid, letting the person reading the + code know exactly which pointers are protected by RCU. Please note
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about value chains and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. [...] The *only* thing that matters for all of them is whether they are value-preserving, or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. What we'd like to capture is that a value originating from a mo_consume load is necessary for a computation (e.g., it cannot be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is necessary. We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler and from elsewhere as well. Your proposal looks quite promising at first glance. But rather than try and comment on it immediately, I am going to take a number
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel trie...@redhat.com wrote: I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I suspect it's hard to really strictly define, but at the same time I actually think that compiler writers (and users, for that matter) have little problem understanding the concept and intent. I do think that listing operations might be useful to give good examples of what is a necessary value, and - perhaps more importantly - what can break the value from being necessary. Especially the gotchas. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Ok, I'm going to cut most of the verbiage since it's long and I'm not commenting on most of it. But Based on these thoughts, we could specify the new mo_consume guarantees roughly as follows: An evaluation E (in an execution) has a value dependency to an atomic and mo_consume load L (in an execution) iff: * L's type holds more than one value (ruling out constants etc.), * L is sequenced-before E, * L's result is used by the abstract machine to compute E, * E is value-dependency-preserving code (defined below), and * at the time of execution of E, L can possibly have returned at least two different values under the assumption that L itself could have returned any value allowed by L's type. If a memory access A's targeted memory location has a value dependency on a mo_consume load L, and an action X inter-thread-happens-before L, then X happens-before A. I think this mostly works. Regarding the latter, we make a fresh start at each mo_consume load (ie, we assume we know nothing -- L could have returned any possible value); I believe this is easier to reason about than other scopes like function granularities (what happens on inlining?), or translation units. It should also be simple to implement for compilers, and would hopefully not constrain optimization too much. [...] Paul's litmus test would work, because we guarantee to the programmer that it can assume that the mo_consume load would return any value allowed by the type; effectively, this forbids the compiler analysis Paul thought about: So realistically, since with the new wording we can ignore the silly cases (ie p-p) and we can ignore the trivial-to-optimize compiler cases (if (p == variable) .. use p), and you would forbid the global value range optimization case that Paul bright up, what remains would seem to be just really subtle compiler transformations of data dependencies to control dependencies. FWIW, I am looking through the kernel for instances of your first if (p == variable) .. use p limus test. All the ones I have found thus far are OK for one of the following reasons: 1. The comparison was against NULL, so you don't get to dereference the pointer anyway. About 80% are in this category. 2. The comparison was against another pointer, but there were no dereferences afterwards. Here is an example of what these can look like: list_for_each_entry_rcu(p, head, next) if (p == variable) return; /* p goes out of scope. */ 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? Some other reasons why it would be OK to dereference after a comparison: 4. The pointed-to data is constant: (a) It was initialized at boot time, (b) the update-side lock is held, (c) we are running in a kthread and the data was initialized before the kthread was created, (d) we are running in a module, and the data was initialized during or before module-init time for that module. And many more besides, involving pretty much every kernel primitive that makes something run later. 5. All subsequent dereferences
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about value chains and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. [...] The *only* thing that matters for all of them is whether they are value-preserving, or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. What we'd like to capture is that a value originating from a mo_consume load is necessary for a computation (e.g., it cannot be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is necessary. We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p-val; then the access of p-val is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If X contains a fence, then that fence will be a stronger ordering than anything the consume through p would guarantee anyway. And if X does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through p is through p or q is kind of irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p-val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p-a = 42; /* Each field in its own cache line. */ p-b = 43; p-c = 44; atomic_store_explicit(gp1, p, memory_order_release); p-b = 143; p-c = 144; atomic_store_explicit(gp2, p, memory_order_release); T2: p = atomic_load_explicit(gp2, memory_order_consume); r1 = p-b; /* Guaranteed to get 143. */ q = atomic_load_explicit(gp1, memory_order_consume); if (p == q) { /* The compiler decides that q-c is same as p-c. */ r2 = p-c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: xagsmtp2.20140227154925.3...@vmsdvm9.vnet.ibm.com On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about value chains and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. [...] The *only* thing that matters for all of them is whether they are value-preserving, or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a semantic dependency due to a value being necessary to compute a result seems better. However, whether a value is necessary might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is necessary. I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. What we'd like to capture is that a value originating from a mo_consume load is necessary for a computation (e.g., it cannot be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is necessary. We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote: xagsmtp2.20140226130517.3...@vmsdvma.vnet.ibm.com X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: Hi, On Thu, 20 Feb 2014, Linus Torvalds wrote: But I'm pretty sure that any compiler guy must *hate* that current odd dependency-generation part, and if I was a gcc person, seeing that bugzilla entry Torvald pointed at, I would personally want to dismember somebody with a rusty spoon.. Yes. Defining dependency chains in the way the standard currently seems to do must come from people not writing compilers. There's simply no sensible way to implement it without being really conservative, because the depchains can contain arbitrary constructs including stores, loads and function calls but must still be observed. And with conservative I mean everything is a source of a dependency, and hence can't be removed, reordered or otherwise fiddled with, and that includes code sequences where no atomic objects are anywhere in sight [1]. In the light of that the only realistic way (meaning to not have to disable optimization everywhere) to implement consume as currently specified is to map it to acquire. At which point it becomes pointless. No, only memory_order_consume loads and [[carries_dependency]] function arguments are sources of dependency chains. However, that is, given how the standard specifies things, just one of the possible ways for how an implementation can handle this. Treating [[carries_dependency]] as a necessary annotation to make exploiting mo_consume work in practice is possible, but it's not required by the standard. Also, dependencies are specified to flow through loads and stores (restricted to scalar objects and bitfields), so any load that might load from a dependency-carrying store can also be a source (and that doesn't seem to be restricted by [[carries_dependency]]). OK, this last is clearly totally unacceptable. :-/ Leaving aside the option of dropping the whole thing for the moment, the only thing that suggests itself is having all dependencies die at a specific point in the code, corresponding to the rcu_read_unlock(). But as far as I can see, that absolutely requires necessary parameter and return marking in order to correctly handle nested RCU read-side critical sections in different functions. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds torva...@linux-foundation.org wrote: Litmus test 1: p = atomic_read(pp, consume); if (p == variable) return p-val; is *NOT* ordered Btw, don't get me wrong. I don't _like_ it not being ordered, and I actually did spend some time thinking about my earlier proposal on strengthening the 'consume' ordering. Understood. I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just ridiculously not worth it. And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) So I think the C semantics should mirror what the hardware gives us - and do so even in the face of reasonable optimizations - not try to do something else that requires compilers to treat consume very differently. I am sure that a great many people would jump for joy at the chance to drop any and all RCU-related verbiage from the C11 and C++11 standards. (I know, you aren't necessarily advocating this, but given what you say above, I cannot think what verbiage that would remain.) The thing that makes me very nervous is how much the definition of reasonable optimization has changed. For example, before the 2.6.10 Linux kernel, we didn't even apply volatile semantics to fetches of RCU-protected pointers -- and as far as I know, never needed to. But since then, there have been several cases where the compiler happily hoisted a normal load out of a surprisingly large loop. Hardware advances can come into play as well. For example, my very first RCU work back in the early 90s was on a parallel system whose CPUs had no branch-prediction hardware (80386 or 80486, I don't remember which). Now people talk about compilers using branch prediction hardware to implement value-speculation optimizations. Five or ten years from now, who knows what crazy optimizations might be considered to be completely reasonable? Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? If people made me king of the world, I'd outlaw weak memory ordering. You can re-order as much as you want in hardware with speculation etc, but you should always *check* your speculation and make it *look* like you did everything in order. Which is pretty much the intel memory ordering (ignoring the write buffering). Speaking as someone who got whacked over the head with DEC Alpha when first presenting RCU to the Digital UNIX folks long ago, I do have some sympathy with this line of thought. But as you say, it is not the world we currently live in. Of course, in the final analysis, your kernel, your call. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: So let me see if I understand your reasoning. My best guess is that it goes something like this: 1. The Linux kernel contains code that passes pointers from rcu_dereference() through external functions. No, actually, it's not so much Linux-specific at all. I'm actually thinking about what I'd do as a compiler writer, and as a defender the C is a high-level assembler concept. I love C. I'm a huge fan. I think it's a great language, and I think it's a great language not because of some theoretical issues, but because it is the only language around that actually maps fairly well to what machines really do. And it's a *simple* language. Sure, it's not quite as simple as it used to be, but look at how thin the KR book is. Which pretty much describes it - still. That's the real strength of C, and why it's the only language serious people use for system programming. Ignore C++ for a while (Jesus Xavier Christ, I've had to do C++ programming for subsurface), and just think about what makes _C_ a good language. The last time I used C++ for a project was in 1990. It was a lot smaller then. I can look at C code, and I can understand what the code generation is, and what it will really *do*. And I think that's important. Abstractions that hide what the compiler will actually generate are bad abstractions. And ok, so this is obviously Linux-specific in that it's generally only Linux where I really care about the code generation, but I do think it's a bigger issue too. So I want C features to *map* to the hardware features they implement. The abstractions should match each other, not fight each other. OK... Actually, the fact that there are more potential optimizations than I can think of is a big reason for my insistence on the carries-a-dependency crap. My lack of optimization omniscience makes me very nervous about relying on there never ever being a reasonable way of computing a given result without preserving the ordering. But if I can give two clear examples that are basically identical from a syntactic standpoint, and one clearly can be trivially optimized to the point where the ordering guarantee goes away, and the other cannot, and you cannot describe the difference, then I think your description is seriously lacking. In my defense, my plan was to constrain the compiler to retain the ordering guarantee in either case. Yes, I did notice that you find that unacceptable. And I do *not* think the C language should be defined by how it can be described. Leave that to things like Haskell or LISP, where the goal is some kind of completeness of the language that is about the language, not about the machines it will run on. I am with you up to the point that the fancy optimizers start kicking in. I don't know how to describe what the optimizers are and are not permitted to do strictly in terms of the underlying hardware. So the code sequence I already mentioned is *not* ordered: Litmus test 1: p = atomic_read(pp, consume); if (p == variable) return p-val; is *NOT* ordered, because the compiler can trivially turn this into return variable.val, and break the data dependency. Right, given your model, the compiler is free to produce code that doesn't order the load from pp against the load from p-val. Yes. Note also that that is what existing compilers would actually do. And they'd do it by mistake: they'd load the address of the variable into a register, and then compare the two registers, and then end up using _one_ of the registers as the base pointer for the p-val access, but I can almost *guarantee* that there are going to be sequences where some compiler will choose one register over the other based on some random detail. So my model isn't just a model, it also happens to descibe reality. Sounds to me like your model -is- reality. I believe that it is useful to constrain reality from time to time, but understand that you vehemently disagree. Indeed, it won't work across different compilation units unless the compiler is told about it, which is of course the whole point of [[carries_dependency]]. Understood, though, the Linux kernel currently does not have anything that could reasonably automatically generate those [[carries_dependency]] attributes. (Or are there other reasons why you believe [[carries_dependency]] is problematic?) So I think carries_dependency is problematic because: - it's not actually in C11 afaik Indeed it is not, but I bet that gcc will implement it like it does the other attributes that are not part of C11. - it requires the programmer to solve the problem of the standard not matching the hardware. The programmer in this instance being the compiler writer? - I
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote: paul...@linux.vnet.ibm.com wrote: torva...@linux-foundation.org wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? To paraphrase some older threads, I think Linus's argument is that weak memory ordering is like branch delay slots: a way to make a simple implementation simpler, but ends up being no help to a more aggressive implementation. Branch delay slots give a one-cycle bonus to in-order cores, but once you go superscalar and add branch prediction, they stop helping, and once you go full out of order, they're just an annoyance. Likewise, I can see the point that weak ordering can help make a simple cache interface simpler, but once you start doing speculative loads, you've already bought and paid for all the hardware you need to do stronger coherency. Another thing that requires all the strong-coherency machinery is a high-performance implementation of the various memory barrier and synchronization operations. Yes, a low-performance (drain the pipeline) implementation is tolerable if the instructions aren't used frequently, but once you're really trying, it doesn't save complexity. Once you're there, strong coherency always doesn't actually cost you any time outside of critical synchronization code, and it both simplifies and speeds up the tricky synchronization software. So PPC and ARM's weak ordering are not the direction the future is going. Rather, weak ordering is something that's only useful in a limited technology window, which is rapidly passing. That does indeed appear to be Intel's story. Might well be correct. Time will tell. If you can find someone in IBM who's worked on the Z series cache coherency (extremely strong ordering), they probably have some useful insights. The big question is if strong ordering, once you've accepted the implementation complexity and area, actually costs anything in execution time. If there's an unavoidable cost which weak ordering saves, that's significant. There has been a lot of ink spilled on this argument. ;-) PPC has much larger CPU counts than does the mainframe. On the other hand, there are large x86 systems. Some claim that there are differences in latency due to the different approaches, and there could be a long argument about whether all this in inherent in the memory ordering or whether it is due to implementation issues. I don't claim to know the answer. I do know that ARM and PPC are here now, and that I need to deal with them. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote: On 02/25/14 17:15, Paul E. McKenney wrote: I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just ridiculously not worth it. And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) I couldn't agree more. Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? That's a question I've been struggling with recently as well. I suspect they (arm, power) are going to be the outliers rather than the canary. While the weaker model may give them some advantages WRT scalability, I don't think it'll ultimately be enough to overcome the difficulty in writing correct low level code for them. Regardless, they're here and we have to deal with them. Agreed... Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: Hi, On Mon, 24 Feb 2014, Linus Torvalds wrote: To me that reads like int i; int *q = i; int **p = q; atomic_XXX (p, CONSUME); orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they want to say that it orders against aliased storage - but then go further and include indirectly through a chain of pointers?! Thus an atomic read of a int * orders against any 'int' memory operation but not against 'float' memory operations? No, it's not about type at all, and the chain of pointers can be much more complex than that, since the int * can point to within an object that contains other things than just that int (the int can be part of a structure that then has pointers to other structures etc). So, let me try to poke holes into your definition or increase my understanding :) . You said chain of pointers(dereferences I assume), e.g. if p is result of consume load, then access to p-here-there-next-prev-stuff is supposed to be ordered with that load (or only when that last load/store itself is also an atomic load or store?). So, what happens if the pointer deref chain is partly hidden in some functions: A * adjustptr (B *ptr) { return ptr-here-there-next; } B * p = atomic_XXX (somewhere, consume); adjustptr(p)-prev-stuff = bla; As far as I understood you, this whole ptrderef chain business would be only an optimization opportunity, right? So if the compiler can't be sure how p is actually used (as in my function-using case, assume adjustptr is defined in another unit), then the consume load would simply be transformed into an acquire (or whatever, with some barrier I mean)? Only _if_ the compiler sees all obvious uses of p (indirectly through pointer derefs) can it, yeah, do what with the consume load? Good point, I left that out of my list. Adding it: 13. By default, pointer chains do not propagate into or out of functions. In implementations having attributes, a [[carries_dependency]] may be used to mark a function argument or return as passing a pointer chain into or out of that function. If a function does not contain memory_order_consume loads and also does not contain [[carries_dependency]] attributes, then that function may be compiled using any desired dependency-breaking optimizations. The ordering effects are implementation defined when a given pointer chain passes into or out of a function through a parameter or return not marked with a [[carries_dependency]] attributed. Note that this last paragraph differs from the current standard, which would require ordering regardless. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 02:55:07PM +0100, Michael Matz wrote: Hi, On Fri, 21 Feb 2014, Paul E. McKenney wrote: And with conservative I mean everything is a source of a dependency, and hence can't be removed, reordered or otherwise fiddled with, and that includes code sequences where no atomic objects are anywhere in sight [1]. In the light of that the only realistic way (meaning to not have to disable optimization everywhere) to implement consume as currently specified is to map it to acquire. At which point it becomes pointless. No, only memory_order_consume loads and [[carries_dependency]] function arguments are sources of dependency chains. I don't see [[carries_dependency]] in the C11 final draft (yeah, should get a real copy, I know, but let's assume it's the same language as the standard). Therefore, yes, only consume loads are sources of dependencies. The problem with the definition of the carries a dependency relation is not the sources, but rather where it stops. It's transitively closed over value of evaluation A is used as operand in evaluation B, with very few exceptions as per 5.1.2.4#14. Evaluations can contain function calls, so if there's _any_ chance that an operand of an evaluation might even indirectly use something resulting from a consume load then that evaluation must be compiled in a way to not break dependency chains. I don't see a way to generally assume that e.g. the value of a function argument can impossibly result from a consume load, therefore the compiler must assume that all function arguments _can_ result from such loads, and so must disable all depchain breaking optimization (which are many). [1] Simple example of what type of transformations would be disallowed: int getzero (int i) { return i - i; } This needs to be as follows: [[carries_dependency]] int getzero(int i [[carries_dependency]]) { return i - i; } Otherwise dependencies won't get carried through it. So, with the above do you agree that in absense of any other magic (see below) the compiler is not allowed to transform my initial getzero() (without the carries_dependency markers) implementation into return 0; because of the C11 rules for carries-a-dependency? If so, do you then also agree that the specification of carries a dependency is somewhat, err, shall we say, overbroad? From what I can see, overbroad. The problem is that the C++11 standard defines how carries-dependency interacts with function calls and returns in 7.6.4, which describes the [[carries_dependency]] attribute. For example, 7.6.4p6 says: Function g’s second parameter has a carries_dependency attribute, but its first parameter does not. Therefore, function h’s first call to g carries a dependency into g, but its second call does not. The implementation might need to insert a fence prior to the second call to g. When C11 declined to take attributes, they also left out the part saying how carries-dependency interacts with functions. :-/ Might be fixed by now, checking up on it. One could argue that the bit about emitting fence instructions at function calls and returns is implied by the as-if rule even without this wording, but... depchains don't matter, could _then_ optmize it to zero. But that's insane, especially considering that it's hard to detect if a given context doesn't care for depchains, after all the depchain relation is constructed exactly so that it bleeds into nearly everywhere. So we would most of the time have to assume that the ultimate context will be depchain-aware and therefore disable many transformations. Any function that does not contain a memory_order_consume load and that doesn't have any arguments marked [[carries_dependency]] can be optimized just as before. And as such marker doesn't exist we must conservatively assume that it's on _all_ parameters, so I'll stand by my claim. Or that you have to emit a fence instruction when a dependency chain enters or leaves a function in cases where all callers/calles are not visible to the compiler. My preference is that the ordering properties of a carries-dependency chain is implementation defined at the point that it enters or leaves a function without the marker, but others strongly disagreed. ;-) Then inlining getzero would merely add another # j.dep = i.dep relation, so depchains are still there but the value optimization can happen before inlining. Having to do something like that I'd find disgusting, and rather rewrite consume into acquire :) Or make the depchain relation somehow realistically implementable. I was actually OK with arithmetic cancellation breaking the dependency chains. Others on the committee felt otherwise, and I figured that (1) I wouldn't be writing that kind of function anyway and (2) they knew
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 09:28:56AM -0800, Paul E. McKenney wrote: On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: Hi, On Mon, 24 Feb 2014, Linus Torvalds wrote: To me that reads like int i; int *q = i; int **p = q; atomic_XXX (p, CONSUME); orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they want to say that it orders against aliased storage - but then go further and include indirectly through a chain of pointers?! Thus an atomic read of a int * orders against any 'int' memory operation but not against 'float' memory operations? No, it's not about type at all, and the chain of pointers can be much more complex than that, since the int * can point to within an object that contains other things than just that int (the int can be part of a structure that then has pointers to other structures etc). So, let me try to poke holes into your definition or increase my understanding :) . You said chain of pointers(dereferences I assume), e.g. if p is result of consume load, then access to p-here-there-next-prev-stuff is supposed to be ordered with that load (or only when that last load/store itself is also an atomic load or store?). So, what happens if the pointer deref chain is partly hidden in some functions: A * adjustptr (B *ptr) { return ptr-here-there-next; } B * p = atomic_XXX (somewhere, consume); adjustptr(p)-prev-stuff = bla; As far as I understood you, this whole ptrderef chain business would be only an optimization opportunity, right? So if the compiler can't be sure how p is actually used (as in my function-using case, assume adjustptr is defined in another unit), then the consume load would simply be transformed into an acquire (or whatever, with some barrier I mean)? Only _if_ the compiler sees all obvious uses of p (indirectly through pointer derefs) can it, yeah, do what with the consume load? Good point, I left that out of my list. Adding it: 13. By default, pointer chains do not propagate into or out of functions. In implementations having attributes, a [[carries_dependency]] may be used to mark a function argument or return as passing a pointer chain into or out of that function. If a function does not contain memory_order_consume loads and also does not contain [[carries_dependency]] attributes, then that function may be compiled using any desired dependency-breaking optimizations. The ordering effects are implementation defined when a given pointer chain passes into or out of a function through a parameter or return not marked with a [[carries_dependency]] attributed. Note that this last paragraph differs from the current standard, which would require ordering regardless. And there is also kill_dependency(), which needs to be added to the list in #8 of operators that take a chained pointer and return something that is not a chained pointer. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 09:38:46AM -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz m...@suse.de wrote: So, let me try to poke holes into your definition or increase my understanding :) . You said chain of pointers(dereferences I assume), e.g. if p is result of consume load, then access to p-here-there-next-prev-stuff is supposed to be ordered with that load (or only when that last load/store itself is also an atomic load or store?). It's supposed to be ordered wrt the first load (the consuming one), yes. So, what happens if the pointer deref chain is partly hidden in some functions: No problem. The thing is, the ordering is actually handled by the CPU in all relevant cases. So the compiler doesn't actually need to *do* anything. All this legalistic stuff is just to describe the semantics and the guarantees. The problem is two cases: (a) alpha (which doesn't really order any accesses at all, not even dependent loads), but for a compiler alpha is actually trivial: just add a rmb instruction after the load, and you can't really do anything else (there's a few optimizations you can do wrt the rmb, but they are very specific and simple). So (a) is a problem, but the solution is actually really simple, and gives very *strong* guarantees: on alpha, a consume ends up being basically the same as a read barrier after the load, with only very minimal room for optimization. (b) ARM and powerpc and similar architectures, that guarantee the data dependency as long as it is an *actual* data dependency, and never becomes a control dependency. On ARM and powerpc, control dependencies do *not* order accesses (the reasons boil down to essentially: branch prediction breaks the dependency, and instructions that come after the branch can be happily executed before the branch). But it's almost impossible to describe that in the standard, since compilers can (and very much do) turn a control dependency into a data dependency and vice versa. So the current standard tries to describe that control vs data dependency, and tries to limit it to a data dependency. It fails. It fails for multiple reasons - it doesn't allow for trivial optimizations that just remove the data dependency, and it also doesn't allow for various trivial cases where the compiler *does* turn the data dependency into a control dependency. So I really really think that the current C standard language is broken. Unfixably broken. I'm trying to change the syntactic data dependency that the current standard uses into something that is clearer and correct. The chain of pointers thing is still obviously a data dependency, but by limiting it to pointers, it simplifies the language, clarifies the meaning, avoids all syntactic tricks (ie p-p is clearly a syntactic dependency on p, but does *not* involve in any way following the pointer) and makes it basically impossible for the compiler to break the dependency without doing value prediction, and since value prediction has to be disallowed anyway, that's a feature, not a bug. OK, good point, please ignore my added thirteenth item in the list. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 10:14:01AM -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 9:21 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer results in another chained pointer in that same pointer chain. No. You cannot define it this way. Taking the value of a pointer and doing a bitwise operation that throws away all the bits (or even *most* of the bits) results in the compiler easily being able to turn the chain into a non-chain. The obvious example being val 0, but things like val 1 are in practice also something that compilers easily turn into control dependencies instead of data dependencies. Indeed, most of the bits need to remain for this to work. So you can talk about things like aligning the pointer value to object boundaries etc, but it really cannot and *must* not be about the syntactic operations. The same goes for adding and subtracting an integer. The *syntax* doesn't matter. It's about remaining information. Doing p-(int)p or p+(-(int)p) doesn't leave any information despite being subtracting and adding an integer at a syntactic level. Syntax is meaningless. Really. Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, p-(long)p where p is a pointer to char) are implementation defined. 4. Bitwise operators (, |, ^, and I suppose also ~) applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, p|~0) are implementation defined. 8. Applying any of the following operators to a chained pointer results in something that is not a chained pointer: (), sizeof, !, *, /, %, , , , , =, =, ==, !=, , and ||. Parenthesis? I'm assuming that you mean calling through the chained pointer. Yes, good point. Of course, parentheses for grouping just pass the value through without affecting the chained-ness. Also, I think all of /, * and % are perfectly fine, and might be used for that aligning the pointer operation that is fine. Something like this? char *p; p = p - (unsigned long)p % 8; I was thinking of this as subtraction -- the p gets moduloed by 8, which loses the chained-pointer designation. But that is OK because that designation gets folded back in by the subtraction. Am I missing a use case? That leaves things like this one: p = (p / 8) * 8; I cannot think of any other legitimate use for / and *. Here is an updated #8 and a new 8a: 8. Applying any of the following operators to a chained pointer results in something that is not a chained pointer: function call (), sizeof, !, %, , , , , =, =, ==, !=, , ||, and kill_dependency(). 8a. Dividing a chained pointer by an integer and multiplying it by that same integer (for example, to align that pointer) results in a chained pointer in that same pointer chain. The ordering effects of other uses of infix * and / on chained pointers are implementation defined. Does that capture it? Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 24, 2014 at 03:35:04PM -0800, Linus Torvalds wrote: On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: What if the nothing modifies 'p' part looks like this: if (p != myvariable) return; and now any sane compiler will happily optimize q = *p into q = myvariable, and we're all done - nothing invalid was ever Yes, the compiler could do that. But it would still be required to carry a dependency from the memory_order_consume read to the *p, But that's *BS*. You didn't actually listen to the main issue. Paul, why do you insist on this carries-a-dependency crap? Sigh. Read on... It's broken. If you don't believe me, then believe the compiler person who already piped up and told you so. The carries a dependency model is broken. Get over it. No sane compiler will ever distinguish two different registers that have the same value from each other. No sane compiler will ever say ok, register r1 has the exact same value as register r2, but r2 carries the dependency, so I need to make sure to pass r2 to that function or use it as a base pointer. And nobody sane should *expect* a compiler to distinguish two registers with the same value that way. So the whole model is broken. I gave an alternate model (the restrict), and you didn't seem to understand the really fundamental difference. It's not a language difference, it's a conceptual difference. In the broken carries a dependency model, you have fight all those aliases that can have the same value, and it is not a fight you can win. We've had the p-p examples, we've had the p0 examples, but the fact is, that p==myvariable example IS EXACTLY THE SAME THING. All three of those things: p-p, p0, and p==myvariable mean that any compiler worth its salt now know that p carries no information, and will optimize it away. So please stop arguing against that. Whenever you argue against that simple fact, you are arguing against sane compilers. So let me see if I understand your reasoning. My best guess is that it goes something like this: 1. The Linux kernel contains code that passes pointers from rcu_dereference() through external functions. 2. Code in the Linux kernel expects the normal RCU ordering guarantees to be in effect even when external functions are involved. 3. When compiling one of these external functions, the C compiler has no way of knowing about these RCU ordering guarantees. 4. The C compiler might therefore apply any and all optimizations to these external functions. 5. This in turn implies that we the only way to prohibit any given optimization from being applied to the results obtained from rcu_dereference() is to prohibit that optimization globally. 6. We have to be very careful what optimizations are globally prohibited, because a poor choice could result in unacceptable performance degradation. 7. Therefore, the only operations that can be counted on to maintain the needed RCU orderings are those where the compiler really doesn't have any choice, in other words, where any reasonable way of computing the result will necessarily maintain the needed ordering. Did I get this right, or am I confused? So *accept* the fact that some operations (and I guarantee that there are more of those than you can think of, and you can create them with various tricks using pretty much *any* feature in the C language) essentially take the data information away. And just accept the fact that then the ordering goes away too. Actually, the fact that there are more potential optimizations than I can think of is a big reason for my insistence on the carries-a-dependency crap. My lack of optimization omniscience makes me very nervous about relying on there never ever being a reasonable way of computing a given result without preserving the ordering. So give up on carries a dependency. Because there will be cases where that dependency *isn't* carried. The language of the standard needs to get *away* from the broken model, because otherwise the standard is broken. I suggest we instead talk about litmus tests and why certain code sequences are ordered, and others are not. OK... So the code sequence I already mentioned is *not* ordered: Litmus test 1: p = atomic_read(pp, consume); if (p == variable) return p-val; is *NOT* ordered, because the compiler can trivially turn this into return variable.val, and break the data dependency. Right, given your model, the compiler is free to produce code that doesn't order the load from pp against the load from p-val. This is true *regardless* of any carries a dependency language, because that language is insane, and doesn't work when the different pieces here may be in different compilation units. Indeed
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Feb 22, 2014 at 07:50:35PM -0800, Linus Torvalds wrote: On Sat, Feb 22, 2014 at 4:39 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Agreed, by far the most frequent use is - to dereference and assignment to store into a local variable. The other operations where the kernel expects ordering to be maintained are: o Bitwise to strip off low-order bits. The FIB tree does this, for example in fib_table_lookup() in net/ipv4/fib_trie.c. The low-order bit is used to distinguish internal nodes from leaves -- nodes and leaves are different types of structures. (There are a few others.) Note that this is very much outside the scope of the C standard, regardless of 'consume' or not. We'll always do things outside the standard, so I wouldn't worry. Fair enough. I am going to make examples to try to make sure that we aren't using the same words for two different meanings: struct foo restrict *p; p = atomic_load_explicit(gp, memory_order_consume); p = ~0x3; /* OK because p has restrict property. */ return p-a;/* Ordered WRT above load. */ Of course, the compiler would have to work pretty hard to break this ordering, so I am with you in not worrying too much about this one. o Uses ?: to substitute defaults in case of NULL pointers, but ordering must be maintained in the non-default case. Most, perhaps all, of these could be converted to if should ?: prove problematic. Note that this doesn't actually affect the restrict/ordering rule in theory: ?: isn't special according to those rules. The rules are fairly simple: we guarantee ordering only to the object that the pointer points to, and even that guarantee goes out the window if there is some *other* way to reach the object. ?: is not really relevant, except in the sense that *any* expression that ends up pointing to outside the object will lose the ordering guarantee. ?: can be one such expression, but so can p-p or anything like that. And in *practice*, the only thing that needs to be sure to generate special code is alpha, and there you'd just add the rmb after the load. That is sufficient to fulfill the guarantees. OK, seems reasonable. Reworking the earlier example: struct foo restrict *p; p = atomic_load_explicit(gp, memory_order_consume); p = p ? p : default_foo; return p-a;/* Ordered WRT above load if p non-NULL. */ And the ordering makes sense only in the non-NULL case anyway, so this should be fine. On ARM and powerpc, the compiler obviously has to guarantee that it doesn't do value-speculation on the result, but again, that never really had anything to do with the whole carries a dependency, it is really all about the fact that in order to guarantee the ordering, the compiler mustn't generate that magical aliased pointer value. But if the aliased pointer value comes from the *source* code, all bets are off. Agreed, compiler-based value speculation is dangerous in any case. (Though the branch-predictor-based trick seems like it should be safe on TSO systems like x86, s390, etc.) Now, even on alpha, the compiler can obviously move that rmb around. For example, if there is a conditional after the atomic_read(mo_consume), and the compiler can tell that the pointer that got read by mo_consume is dead along one branch, then the compiler can move the rmb to only exist in the other branch. Why? Because we inherently guarantee only the order to any accesses to the object the pointer pointed to, and that the pointer that got loaded is the *only* way to get to that object (in this context), so if the value is dead, then so is the ordering. Yep! In fact, even if the value is *not* dead, but it is NULL, the compiler can validly say the NULL pointer cannot point to any object, so I don't have to guarantee any serialization. So code like this (writing alpha assembly, since in practice only alpha will ever care): ptr = atomic_read(pp, mo_consume); if (ptr) { ... do something with ptr .. } return ptr; can validly be translated to: ldq $1,0($2) beq $1,branch-over rmb .. the do-something code using register $1 .. because the compiler knows that a NULL pointer cannot be dereferenced, so it can decide to put the rmb in the non-NULL path - even though the pointer value is still *live* in the other branch (well, the liveness of a constant value is somewhat debatable, but you get the idea), and may be used by the caller (but since it is NULL, the use can not include accessing any object, only really testing) Agreed. So note how this is actually very different from the carries dependency rule. It's simpler, and it allows much more natural optimizations. I do have a couple more questions about it, but please see below. o Addition and subtraction to adjust both
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 23, 2014 at 11:31:25AM -0800, Linus Torvalds wrote: On Sat, Feb 22, 2014 at 10:34 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Adding and subtracting integers to/from a RCU-protected pointer makes sense to me. Ack. And that's normal access to an object behavior anyway. And covers things like container_of() as well as array indexing and field selection, so good. Adding and subtracting integers to/from an RCU-protected integer makes sense in many practical cases, but I worry about the compiler figuring out that the RCU-protected integer cancelled with some other integer. I suspect that in practice, all the same normal rules apply - assuming that there aren't any aliasing integers, and that the compiler doesn't do any value prediction, the end result *should* be exactly the same as for the pointer arithmetic. So even if the standard were to talk about just pointers, I suspect that in practice, there really isn't anything that can go wrong. Yep, in practice in absence of the i-i code, it should work. As long as the integer returned from the memory_order_consume load is always kept in an integer tagged as restrict, I believe that it should be possible to make the standard guarantee it as well. I don't feel all that good about subtractions involving an RCU-protected pointer and another pointer, again due to the possibility of arithmetic optimizations cancelling everything. Actually, pointer subtraction is a pretty useful operation, even without the gotcha case of p-p just to force a fake dependency. Getting an index, or indeed just getting an offset within a structure, is valid and common, and people will and should do it. It doesn't really matter for my suggested language: it should be perfectly fine to do something like ptr = atomic_read(pp, mo_consume); index = ptr - array_base; .. pass off 'index' to some function, which then re-generates the ptr using it .. I believe that the devil is in the details of the regeneration of the pointer. Yet another example below. and the compiler will have no trouble generating code, and the suggested ordered wrt that object guarantee is that the eventual ordering is between the pointer load and the use in the function, regardless of how it got there (ie turning it into an index and back is perfectly fine). So both from a legalistic language wording standpoint, and from a compiler code generation standpoint, this is a non-issue. Now, if you actually lose information (ie some chain there drops enough data from the pointer that it cannot be recovered, partially of fully), and then regenerate the object value by faking it, and still end up accessing the right data, but without actually going through any of the the pointer that you loaded, that falls under the restricted heading, and you must clearly at least partially have used other information. In which case the standard wording wouldn't guarantee anything at all. I agree in general, but believe that the pointer regeneration needs to be done with care. If you are adding the difference back into the original variable ptr read from the RCU-protected pointer, I have no problem with it. My concern is with more elaborate cases. For example: struct foo g1[N]; struct bar g2[N]; struct foo restrict *p1; struct bar *p2; int index; p1 = atomic_read_explicit(*gp1, memory_order_consume); index = (ptr - g1) / 2; /* Getting this into the BS syntax realm. */ p2 = g2 + index; The standard as currently worded would force ordering between the read into p1 and the read into p2. But it does so via dependency chains, so this is starting to feel to me like we are getting back into the business of tracing dependency chains. Of course, one can argue that parallel arrays are not all that useful in the Linux kernel, and one can argue that dividing indexes by two is beyond the pale. (Although heaps in dense arrays really do this sort of index arithmetic, I am having some trouble imagining why RCU protection would be useful in that case. Yes, I could make up something about replacing one heap locklessly with another, but...) I might well be in my usual overly paranoid mode, but this is the sort of thing that makes me a bit nervous. I would rather either have it bulletproof safe, or to say don't do that. Hmmm... I don't recall seeing a use case for subtraction involving an RCU-protected pointer and another pointer when I looked through all the rcu_dereference() invocations in the Linux kernel. Is there a use case that I am missing? I don't see that you would need to protect anything but the first read, so I think you need rcu_dereference() only on the initial pointer access. Let me try an example: struct bar { int a; unsigned b; }; struct foo { struct bar *next
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 23, 2014 at 05:35:28PM -0800, Linus Torvalds wrote: On Sun, Feb 23, 2014 at 5:16 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: (a) we've said 'q' is restricted, so there is no aliasing between q and the pointers b/c. So the compiler is free to move those accesses around the q = p-next access. Ah, if I understand you, very good! My example intentionally left q -not- restricted. No, I 100% agree with that. q is *not* restricted. But p is, since it came from that consuming load. But q = p-next is ordered by how something can alias p-next, not by 'q'! There is no need to restrict anything but 'p' for all of this to work. I cannot say I understand this last sentence right new from the viewpoint of the standard, but suspending disbelief for the moment... (And yes, given current compilers and CPUs, I agree that this should all work in practice. My concern is the legality, not the reality.) Btw, it's also worth pointing out that I do *not* in any way expect people to actually write the restrict keyword anywhere. So no need to change source code. Understood -- in this variant, you are taking the marking from the fact that there was an assignment from a memory_order_consume load rather than from a keyword on the assigned-to variable's declaration. What you have is a situation where the pointer coming out of the memory_order_consume is restricted. But if you assign it to a non-restricted pointer, that's *fine*. That's perfectly normal C behavior. The restrict concept is not something that the programmer needs to worry about or ever even notice, it's basically just a promise to the compiler that if somebody has another pointer lying around, accesses though that other pointer do not require ordering. So it sounds like you believe that the programmer would mark things restrict, and I did not mean that at all. Indeed I did believe that. I must confess that I was looking for an easy way to express in standardese -exactly- where the ordering guarantee did and did not propagate. The thing is that the vast majority of the Linux-kernel RCU code is more than happy with the guarantee only applying to fetches via the pointer returned from the memory_order_consume load. There are relatively few places where groups of structures are made visible to RCU readers via a single rcu_assign_pointer(). I guess I need to actually count them. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Feb 22, 2014 at 07:30:37PM +0100, Torvald Riegel wrote: xagsmtp2.20140222183231.5...@emeavsc.vnet.ibm.com X-Xagent-Gateway: emeavsc.vnet.ibm.com (XAGSMTP2 at EMEAVSC) On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote: On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140220172700.0...@vmsdvm4.vnet.ibm.com X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4) On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel trie...@redhat.com wrote: On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: Can you point to it? Because I can find a draft standard, and it sure as hell does *not* contain any clarity of the model. It has a *lot* of verbiage, but it's pretty much impossible to actually understand, even for somebody who really understands memory ordering. http://www.cl.cam.ac.uk/~mjb220/n3132.pdf This has an explanation of the model up front, and then the detailed formulae in Section 6. This is from 2010, and there might have been smaller changes since then, but I'm not aware of any bigger ones. Ahh, this is different from what others pointed at. Same people, similar name, but not the same paper. I will read this version too, but from reading the other one and the standard in parallel and trying to make sense of it, it seems that I may have originally misunderstood part of the whole control dependency chain. The fact that the left side of ? :, and || breaks data dependencies made me originally think that the standard tried very hard to break any control dependencies. Which I felt was insane, when then some of the examples literally were about the testing of the value of an atomic read. The data dependency matters quite a bit. The fact that the other Mathematical paper then very much talked about consume only in the sense of following a pointer made me think so even more. But reading it some more, I now think that the whole data dependency logic (which is where the special left-hand side rule of the ternary and logical operators come in) are basically an exception to the rule that sequence points end up being also meaningful for ordering (ok, so C11 seems to have renamed sequence points to sequenced before). So while an expression like atomic_read(p, consume) ? a : b; doesn't have a data dependency from the atomic read that forces serialization, writing if (atomic_read(p, consume)) a; else b; the standard *does* imply that the atomic read is happens-before wrt a, and I'm hoping that there is no question that the control dependency still acts as an ordering point. The control dependency should order subsequent stores, at least assuming that a and b don't start off with identical stores that the compiler could pull out of the if and merge. The same might also be true for ?: for all I know. (But see below) I don't think this is quite true. I agree that a conditional store will not be executed speculatively (note that if it would happen in both the then and the else branch, it's not conditional); so, the store in a; (assuming it would be a store) won't happen unless the thread can really observe a true value for p. However, this is *this thread's* view of the world, but not guaranteed to constrain how any other thread sees the state. mo_consume does not contribute to inter-thread-happens-before in the same way that mo_acquire does (which *does* put a constraint on i-t-h-b, and thus enforces a global constraint that all threads have to respect). Is it clear which distinction I'm trying to show here? If you are saying that the control dependencies are a result of a combination of the standard and the properties of the hardware that Linux runs on, I am with you. (As opposed to control dependencies being a result solely of the standard.) I'm not quite sure I understand what you mean :) Do you mean the control dependencies in the binary code, or the logical control dependencies in source programs? At present, the intersection of those two sets, but only including those control dependencies beginning with with a memory_order_consume load or a [[carries_dependency]] function argument or return value. Or something like that. ;-) This was a deliberate decision in 2007 or so. At that time, the documentation on CPU memory orderings were pretty crude, and it was not clear that all relevant hardware respected control dependencies. Back then, if you wanted an authoritative answer even to a fairly simple memory
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Feb 22, 2014 at 01:53:30PM -0800, Linus Torvalds wrote: On Sat, Feb 22, 2014 at 10:53 AM, Torvald Riegel trie...@redhat.com wrote: Stating that (1) the standard is wrong and (2) that you think that mo_consume semantics are not good is two different things. I do agree. They are two independent things. I think the standard is wrong, because it's overly complex, hard to understand, and nigh unimplementable. As shown by the bugzilla example, carries a dependency encompasses things that are *not* just synchronizing things just through a pointer, and as a result it's actually very complicated, since they could have been optimized away, or done in non-local code that wasn't even aware of the dependency carrying. That said, I'm reconsidering my suggested stricter semantics, because for RCU we actually do want to test the resulting pointer against NULL _without_ any implied serialization. So I still feel that the standard as written is fragile and confusing (and the bugzilla entry pretty much proves that it is also practically unimplementable as written), but strengthening the serialization may be the wrong thing. Within the kernel, the RCU use for this is literally purely about loading a pointer, and doing either: - testing its value against NULL (without any implied synchronization at all) - using it as a pointer to an object, and expecting that any accesses to that object are ordered wrt the consuming load. Agreed, by far the most frequent use is - to dereference and assignment to store into a local variable. The other operations where the kernel expects ordering to be maintained are: o Bitwise to strip off low-order bits. The FIB tree does this, for example in fib_table_lookup() in net/ipv4/fib_trie.c. The low-order bit is used to distinguish internal nodes from leaves -- nodes and leaves are different types of structures. (There are a few others.) o Uses ?: to substitute defaults in case of NULL pointers, but ordering must be maintained in the non-default case. Most, perhaps all, of these could be converted to if should ?: prove problematic. o Addition and subtraction to adjust both pointers to and indexes into RCU-protected arrays. There are not that many indexes, and they could be converted to pointers, but the addition and subtraction looks necessary in a some cases. o Array indexing. The value from rcu_dereference() is used both before and inside the [], interestingly enough. o Casts along with unary and *. That said, I did not see any code that dependended on ordering through the function-call (), boolean complement !, comparison (only == and !=), logical operators ( and ||), and the *, /, and % arithmetic operators. So I actually have a suggested *very* different model that people might find more acceptable. How about saying that the result of a atomic_read(a, mo_consume) is required to be a _restricted_ pointer type, and that the consume ordering guarantees the ordering between that atomic read and the accesses to the object that the pointer points to. No carries a dependency, no nothing. In the case of arrays, the object that the pointer points to is considered to be the full array, right? Now, there's two things to note in there: - the restricted pointer part means that the compiler does not need to worry about serialization to that object through other possible pointers - we have basically promised that the *only* pointer to that object comes from the mo_consume. So that part makes it clear that the consume ordering really only is valid wrt that particular pointer load. That could work, though there are some cases where a multi-linked structure is made visible using a single rcu_assign_pointer(), and rcu_dereference() is used only for the pointer leading to that multi-linked structure, not for the pointers among the elements making up that structure. One way to handle this would be to require rcu_dereference() to be used within the structure an well as upon first traversal to the structure. - the to the object that the pointer points to makes it clear that you can't use the pointer to generate arbitrary other values and claim to serialize that way. IOW, with those alternate semantics, that gcc bugzilla example is utterly bogus, and a compiler can ignore it, because while it tries to synchronize through the dependency chain created with that p-i+i expression, that is completely irrelevant when you use the above rules instead. In the bugzilla example, the object that *(p-i+i) accesses isn't actually the object pointed to by the pointer, so no serialization is implied. And if it actually *were* to be the same object, because p happens to have the same value as i, then the restrict part of the rule pops up and the compiler can again say that there is no ordering guarantee, since
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: Hi, On Thu, 20 Feb 2014, Linus Torvalds wrote: But I'm pretty sure that any compiler guy must *hate* that current odd dependency-generation part, and if I was a gcc person, seeing that bugzilla entry Torvald pointed at, I would personally want to dismember somebody with a rusty spoon.. Yes. Defining dependency chains in the way the standard currently seems to do must come from people not writing compilers. There's simply no sensible way to implement it without being really conservative, because the depchains can contain arbitrary constructs including stores, loads and function calls but must still be observed. And with conservative I mean everything is a source of a dependency, and hence can't be removed, reordered or otherwise fiddled with, and that includes code sequences where no atomic objects are anywhere in sight [1]. In the light of that the only realistic way (meaning to not have to disable optimization everywhere) to implement consume as currently specified is to map it to acquire. At which point it becomes pointless. No, only memory_order_consume loads and [[carries_dependency]] function arguments are sources of dependency chains. So I suspect there are a number of people who would be *more* than happy with a change to those odd dependency rules. I can't say much about your actual discussion related to semantics of atomics, not my turf. But the carries a dependency relation is not usefully implementable. Ciao, Michael. [1] Simple example of what type of transformations would be disallowed: int getzero (int i) { return i - i; } This needs to be as follows: [[carries_dependency]] int getzero(int i [[carries_dependency]]) { return i - i; } Otherwise dependencies won't get carried through it. Should be optimizable to return 0;, right? Not with carries a dependency in place: int jeez (int idx) { int i = atomic_load(idx, memory_order_consume); // A int j = getzero (i);// B return array[j];// C } As I read carries a dependency there's a dependency from A to C. Now suppose we would optimize getzero in the obvious way, then inline, and boom, dependency gone. So we wouldn't be able to optimize any function when we don't control all its users, for fear that it _might_ be used in some dependency chain where it then matters that we possibly removed some chain elements due to the transformation. We would have to retain 'i-i' before inlining, and if the function then is inlined into a context where depchains don't matter, could _then_ optmize it to zero. But that's insane, especially considering that it's hard to detect if a given context doesn't care for depchains, after all the depchain relation is constructed exactly so that it bleeds into nearly everywhere. So we would most of the time have to assume that the ultimate context will be depchain-aware and therefore disable many transformations. Any function that does not contain a memory_order_consume load and that doesn't have any arguments marked [[carries_dependency]] can be optimized just as before. There'd be one solution to the above, we would have to invent some special operands and markers that explicitely model carries-a-dep, ala this: int getzero (int i) { #RETURN.dep = i.dep return 0; } The above is already handled by the [[carries_dependency]] attribute, see above. int jeez (int idx) { # i.dep = idx.dep int i = atomic_load(idx, memory_order_consume); // A # j.dep = i.dep int j = getzero (i);// B # RETURN.dep = j.dep + array.dep return array[j];// C } Then inlining getzero would merely add another # j.dep = i.dep relation, so depchains are still there but the value optimization can happen before inlining. Having to do something like that I'd find disgusting, and rather rewrite consume into acquire :) Or make the depchain relation somehow realistically implementable. I was actually OK with arithmetic cancellation breaking the dependency chains. Others on the committee felt otherwise, and I figured that (1) I wouldn't be writing that kind of function anyway and (2) they knew more about writing compilers than I. I would still be OK saying that things like i-i, i*0, i%1, i0, i|~0 and so on just break the dependency chain. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 21, 2014 at 06:28:05PM +, Peter Sewell wrote: On 20 February 2014 17:01, Linus Torvalds torva...@linux-foundation.org wrote: [ . . . ] So, if you make one of two changes to your example, then I will agree with you. No. We're not playing games here. I'm fed up with complex examples that make no sense. Nobody sane writes code that does that pointer comparison, and it is entirely immaterial what the compiler can do behind our backs. The C standard semantics need to make sense to the *user* (ie programmer), not to a CPU and not to a compiler. The CPU and compiler are tools. They don't matter. Their only job is to make the code *work*, dammit. So no idiotic made-up examples that involve code that nobody will ever write and that have subtle issues. So the starting point is that (same example as before, but with even clearer naming): Initialization state: initialized = 0; value = 0; Consumer: return atomic_read(initialized, consume) ? value : -1; Writer: value = 42; atomic_write(initialized, 1, release); and because the C memory ordering standard is written in such a way that this is subtly buggy (and can return 0, which is *not* logically a valid value), then I think the C memory ordering standard is broken. That consumer memory ordering is dangerous as hell, and it is dangerous FOR NO GOOD REASON. The trivial fix to the standard would be to get rid of all the carries a dependency crap, and just say that *anything* that depends on it is ordered wrt it. There are two difficulties with this, if I understand correctly what you're proposing. The first is knowing where to stop. If one includes all data and control dependencies, pretty much all the continuation of execution would depend on the consume read, so the compiler would eventually have to give up and insert a gratuitous barrier. One might imagine somehow annotating the return_expensive_system_value() you have below to say stop dependency tracking at the return (thereby perhaps enabling the compiler to optimise the barrier that you'd need in h/w to order the Linus-consume-read of initialised and the non-atomic read of calculated, replacing it by a compiler-introduced artificial dependency), and indeed that's roughly what the standard's kill_dependency does for consumes. One way to tell the compiler where to stop would be to place markers in the source code saying where dependencies stop. These markers could be provided by the definitions of the current rcu_read_unlock() tags in the Linux kernel (and elsewhere, for that matter). These would be overridden by [[carries_dependency]] tags on function arguments and return values, which is needed to handle the possibility of nested RCU read-side critical sections. The second is the proposal in later mails to use some notion of semantic dependency instead of this syntactic one. That's maybe attractive at first sight, but rather hard to define in a decent way in general. To define whether the consume load can really affect some subsequent value, you need to know about all the set of possible executions of the program - which is exactly what we have to define. For syntactic dependencies, in contrast, you can at least tell whether they exist by examining the source code you have in front of you. The fact that artificial dependencies like (x + y-y) are significant is (I guess) basically incidental at the C level - sometimes things like this are the best idiom to enforce ordering at the assembly level, but in C I imagine they won't normally arise. If they do, it might be nicer to have a more informative syntax, eg (x + dependency(y)). This was in fact one of the arguments put forward in favor of carrying dependencies through things like y-y back in the 2007-8 timeframe. Can't say that I am much of a fan of manually tagging all dependencies: Machines are much better at that sort of thing than are humans. But just out of curiosity, did you instead mean (x + dependency(y-y)) or some such? Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 21, 2014 at 10:10:54PM +, Joseph S. Myers wrote: On Fri, 21 Feb 2014, Paul E. McKenney wrote: This needs to be as follows: [[carries_dependency]] int getzero(int i [[carries_dependency]]) { return i - i; } Otherwise dependencies won't get carried through it. C11 doesn't have attributes at all (and no specification regarding calls and dependencies that I can see). And the way I read the C++11 specification of carries_dependency is that specifying carries_dependency is purely about increasing optimization of the caller: that if it isn't specified, then the caller doesn't know what dependencies might be carried. Note: The carries_dependency attribute does not change the meaning of the program, but may result in generation of more efficient code. - end note. Good point -- I am so used to them being in gcc that I missed that. In which case, it seems to me that straight C11 is within its rights to emit a memory barrier just before control passes into a function that either it can't see or that it chose to apply dependency-breaking optimizations to. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote: On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: The control dependency should order subsequent stores, at least assuming that a and b don't start off with identical stores that the compiler could pull out of the if and merge. The same might also be true for ?: for all I know. (But see below) Stores I don't worry about so much because (a) you can't sanely move stores up in a compiler anyway (b) no sane CPU or moves stores up, since they aren't on the critical path so a read-cmp-store is actually really hard to make anything sane re-order. I'm sure it can be done, and I'm sure it's stupid as hell. But that it's hard to screw up is *not* true for a load-cmp-load. So lets make this really simple: if you have a consume-cmp-read, is the ordering of the two reads guaranteed? Not as far as I know. Also, as far as I know, there is no difference between consume and relaxed in the consume-cmp-read case. As long as there is an unbroken chain of -data- dependencies from the consume to the later access in question, and as long as that chain doesn't go through the excluded operations, yes. So let's make it *really* specific, and make it real code doing a real operation, that is actually realistic and reasonable in a threaded environment, and may even be in some critical code. The issue is the read-side ordering guarantee for 'a' and 'b', for this case: - Initial state: a = b = 0; - Thread 1 (consumer): if (atomic_read(a, consume)) return b; /* not yet initialized */ return -1; - Thread 2 (initializer): b = some_value_lets_say_42; /* We are now ready to party */ atomic_write(a, 1, release); and quite frankly, if there is no ordering guarantee between the read of a and the read of b in the consumer thread, then the C atomics standard is broken. Put another way: I claim that if thread 1 ever sees a return value other than -1 or 42, then the whole definition of atomics is broken. The above example can have a return value of 0 if translated straightforwardly into either ARM or Power, right? Both of these can speculate a read into a conditional, and both can translate a consume load into a plain load if data dependencies remain unbroken. So, if you make one of two changes to your example, then I will agree with you. The first change is to have a real data dependency between the read of a and the second read: - Initial state: a = c, b = 0; c = 0; - Thread 1 (consumer): if (atomic_read(a, consume)) return *a; /* not yet initialized */ return -1; - Thread 2 (initializer): b = some_value_lets_say_42; /* We are now ready to party */ atomic_write(a, b, release); The second change is to make the consume be an acquire: - Initial state: a = b = 0; - Thread 1 (consumer): if (atomic_read(a, acquire)) return b; /* not yet initialized */ return -1; - Thread 2 (initializer): b = some_value_lets_say_42; /* We are now ready to party */ atomic_write(a, 1, release); In theory, you could also change the return to a store, but the example gets a bit complicated and as far as I can tell you get into the state where the standard does not explicitly support it, even though I have a hard time imagining an actual implementation that fails to support it. Question 2: and what changes if the atomic_read() is turned into an acquire, and why? Does it start working? Yep, that is the second change above. Neither has a data-dependency guarantee, because there is no data dependency from the load to either a or b. After all, the value loaded got absorbed into the if condition. However, according to discussions earlier in this thread, the if variant would have a control-dependency ordering guarantee for any stores in a and b (but not loads!). So exactly what part of the standard allows the loads to be re-ordered, and why? Quite frankly, I'd think that any sane person will agree that the above code snippet is realistic, and that my requirement that thread 1 sees either -1 or 42 is valid. Unless I am really confused, weakly ordered systems would be just fine returning zero in your original example. In the case of ARM, I believe you need either a data dependency, an ISB after the branch, or a DMB instruction to force the ordering you want. In the case of PowerPC, I believe that you need either a data dependency, an isync after the branch, an lwsync, or a sync. I would not expect a C compiler to generate code with any of these precautions in place. And if the C standards body has said that control dependencies break the read ordering, then I really think that the C standards committee has screwed up. The control dependencies don't break the read ordering, but rather, they are insufficient to preserve the read
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 12:30:32AM -0800, Paul E. McKenney wrote: On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote: [ . . . ] So, if you make one of two changes to your example, then I will agree with you. The first change is to have a real data dependency between the read of a and the second read: - Initial state: a = c, b = 0; c = 0; - Thread 1 (consumer): if (atomic_read(a, consume)) And the above should be if (atomic_read(a, consume) != c). Sigh!!! Thanx, Paul return *a; /* not yet initialized */ return -1; - Thread 2 (initializer): b = some_value_lets_say_42; /* We are now ready to party */ atomic_write(a, b, release);
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 09:01:06AM -0800, Linus Torvalds wrote: On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: So lets make this really simple: if you have a consume-cmp-read, is the ordering of the two reads guaranteed? Not as far as I know. Also, as far as I know, there is no difference between consume and relaxed in the consume-cmp-read case. Ok, quite frankly, I think that means that consume is misdesigned. The above example can have a return value of 0 if translated straightforwardly into either ARM or Power, right? Correct. And I think that is too subtle. It's dangerous, it makes code that *looks* correct work incorrectly, and it actually happens to work on x86 since x86 doesn't have crap-for-brains memory ordering semantics. So, if you make one of two changes to your example, then I will agree with you. No. We're not playing games here. I'm fed up with complex examples that make no sense. Hey, your original example didn't do what you wanted given the current standard. Those two modified examples are no more complex than your original, and are the closest approximations that I can come up with right off-hand that provide the result you wanted. Nobody sane writes code that does that pointer comparison, and it is entirely immaterial what the compiler can do behind our backs. The C standard semantics need to make sense to the *user* (ie programmer), not to a CPU and not to a compiler. The CPU and compiler are tools. They don't matter. Their only job is to make the code *work*, dammit. So no idiotic made-up examples that involve code that nobody will ever write and that have subtle issues. There are places in the Linux kernel that do both pointer comparisons and dereferences. Something like the following: p = rcu_dereference(gp); if (!p) p = default_structure; do_something_with(p-a, p-b); In the typical case where default_structure was initialized early on, no ordering is needed in the !p case. So the starting point is that (same example as before, but with even clearer naming): Initialization state: initialized = 0; value = 0; Consumer: return atomic_read(initialized, consume) ? value : -1; Writer: value = 42; atomic_write(initialized, 1, release); and because the C memory ordering standard is written in such a way that this is subtly buggy (and can return 0, which is *not* logically a valid value), then I think the C memory ordering standard is broken. You really need that consume to be acquire. That consumer memory ordering is dangerous as hell, and it is dangerous FOR NO GOOD REASON. The trivial fix to the standard would be to get rid of all the carries a dependency crap, and just say that *anything* that depends on it is ordered wrt it. That just means that on alpha, consume implies an unconditional read barrier (well, unless the value is never used and is loaded just because it is also volatile), on x86, consume is the same as acquire which is just a plain load with ordering guarantees, and on ARM or power you can still avoid the extra synchronization *if* the value is used just for computation and for following pointers, but if the value is used for a comparison, there needs to be a synchronization barrier. Except in the default-value case, where no barrier is required. People really do comparisons on the pointers that they get back from rcu_dereference(), either to check for NULL (as above) or to check for a specific pointer. Ordering is not required in those cases. The cases that do require ordering from memory_order_consume values being used in an if statement should instead be memory_order_acquire. Notice? Getting rid of the stupid carries-dependency crap from the standard actually (a) simplifies the standard (b) means that the above obvious example *works* (c) does not in *any* way make for any less efficient code generation for the cases that consume works correctly for in the current mis-designed standard. (d) is actually a hell of a lot easier to explain to a compiler writer, and I can guarantee that it is simpler to implement too. Why do I claim (d) it is simpler to implement - because on ARM/power you can implement it *exactly* as a special acquire, with just a trivial peep-hole special case that follows the use chain of the acquire op to the consume, and then just drop the acquire bit if the only use is that compute-to-load chain. I don't believe that it is quite that simple. But yes, the compiler guys would be extremely happy to simply drop memory_order_consume from the standard, as it is the memory order that they most love to hate. Getting them to agree to any sort of peep-hole optimization semantics for memory_order_consume is likely problematic. In fact, realistically, the *only* thing you need to actually care about for the intended use case of consume
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote: xagsmtp2.20140220172700.0...@vmsdvm4.vnet.ibm.com X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4) On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel trie...@redhat.com wrote: On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: Can you point to it? Because I can find a draft standard, and it sure as hell does *not* contain any clarity of the model. It has a *lot* of verbiage, but it's pretty much impossible to actually understand, even for somebody who really understands memory ordering. http://www.cl.cam.ac.uk/~mjb220/n3132.pdf This has an explanation of the model up front, and then the detailed formulae in Section 6. This is from 2010, and there might have been smaller changes since then, but I'm not aware of any bigger ones. Ahh, this is different from what others pointed at. Same people, similar name, but not the same paper. I will read this version too, but from reading the other one and the standard in parallel and trying to make sense of it, it seems that I may have originally misunderstood part of the whole control dependency chain. The fact that the left side of ? :, and || breaks data dependencies made me originally think that the standard tried very hard to break any control dependencies. Which I felt was insane, when then some of the examples literally were about the testing of the value of an atomic read. The data dependency matters quite a bit. The fact that the other Mathematical paper then very much talked about consume only in the sense of following a pointer made me think so even more. But reading it some more, I now think that the whole data dependency logic (which is where the special left-hand side rule of the ternary and logical operators come in) are basically an exception to the rule that sequence points end up being also meaningful for ordering (ok, so C11 seems to have renamed sequence points to sequenced before). So while an expression like atomic_read(p, consume) ? a : b; doesn't have a data dependency from the atomic read that forces serialization, writing if (atomic_read(p, consume)) a; else b; the standard *does* imply that the atomic read is happens-before wrt a, and I'm hoping that there is no question that the control dependency still acts as an ordering point. The control dependency should order subsequent stores, at least assuming that a and b don't start off with identical stores that the compiler could pull out of the if and merge. The same might also be true for ?: for all I know. (But see below) I don't think this is quite true. I agree that a conditional store will not be executed speculatively (note that if it would happen in both the then and the else branch, it's not conditional); so, the store in a; (assuming it would be a store) won't happen unless the thread can really observe a true value for p. However, this is *this thread's* view of the world, but not guaranteed to constrain how any other thread sees the state. mo_consume does not contribute to inter-thread-happens-before in the same way that mo_acquire does (which *does* put a constraint on i-t-h-b, and thus enforces a global constraint that all threads have to respect). Is it clear which distinction I'm trying to show here? If you are saying that the control dependencies are a result of a combination of the standard and the properties of the hardware that Linux runs on, I am with you. (As opposed to control dependencies being a result solely of the standard.) This was a deliberate decision in 2007 or so. At that time, the documentation on CPU memory orderings were pretty crude, and it was not clear that all relevant hardware respected control dependencies. Back then, if you wanted an authoritative answer even to a fairly simple memory-ordering question, you had to find a hardware architect, and you probably waited weeks or even months for the answer. Thanks to lots of work from the Cambridge guys at about the time that the standard was finalized, we have a much better picture of what the hardware does. That said, in this case, you could substitute relaxed for consume and get the same effect. The return value from atomic_read() gets absorbed into the if condition, so there is no dependency-ordered-before relationship, so nothing for consume to do. One caution... The happens-before relationship requires you to trace a full path between the two operations of interest. This is illustrated by the following example, with both x and y initially zero: T1: atomic_store_explicit(x, 1, memory_order_relaxed); r1 = atomic_load_explicit(y
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 06:54:06PM +0100, Torvald Riegel wrote: xagsmtp4.20140220175519.1...@vmsdvm6.vnet.ibm.com X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP4 at VMSDVM6) On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote: Well, all the compilers currently convert consume to acquire, so you have your wish there. Of course, that also means that they generate actual unneeded memory-barrier instructions, which seems extremely sub-optimal to me. GCC doesn't currently, but it also doesn't seem to track the dependencies, but that's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 Ah, cool! Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 09:34:57AM -0800, Linus Torvalds wrote: On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel trie...@redhat.com wrote: So the clarification is basically to the statement that the if (consume(p)) a version *would* have an ordering guarantee between the read of p and a, but the consume(p) ? a : b would *not* have such an ordering guarantee. Yes? Not as I understand it. If my reply above wasn't clear, let me know and I'll try to rephrase it into something that is. Yeah, so you and Paul agree. And as I mentioned in the email that crossed with yours, I think that means that the standard is overly complex, hard to understand, fragile, and all *pointlessly* so. Btw, there are many ways that use a consume as an input to a conditional can happen. In particular, even if the result is actually *used* like a pointer as far as the programmer is concerned, tricks like pointer compression etc can well mean that the pointer is actually at least partly implemented using conditionals, so that some paths end up being only dependent through a comparison of the pointer value. So I very much did *not* want to complicate the litmus test code snippet when Paul tried to make it more complex, but I do think that there are cases where code that looks like pure pointer chasing actually is not for some cases, and then can become basically that litmus test for some path. Just to give you an example: in simple list handling it is not at all unusual to have a special node that is discovered by comparing the address, not by just loading from the pointer and following the list itself. Examples of that would be a HEAD node in a doubly linked list (Linux uses this concept quite widely, it's our default list implementation), or it could be a place-marker (cursor entry) in the list for safe traversal in the presence of concurrent deletes etc. Yep, good example. And also an example where the comparison does not need to create any particular ordering. And obviously there is the already much earlier mentioned compiler-induced compare, due to value speculation, that can basically create such sequences even wherethey did not originally exist in the source code itself. So even if you work with pointer dereferences, and thus match that particular consume pattern, I really don't see why anybody would think that hey, we can ignore any control dependencies is a good idea. It's a *TERRIBLE* idea. And as mentioned, it's a terrible idea with no upsides. It doesn't help compiler optimizations for the case it's *intended* to help, since those optimizations can still be done without the horribly broken semantics. It doesn't help compiler writers, it just confuses them. And it sure as hell doesn't help users. Yep. And in 2014 we know a lot more about the hardware, so could make a reasonable proposal. In contrast, back in 2007 and 2008 memory ordering was much more of a dark art, and proposals for control dependencies therefore didn't get very far. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 10:32:51AM -0800, Linus Torvalds wrote: On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: You really need that consume to be acquire. So I think we now all agree that that is what the standard is saying. And I'm saying that that is wrong, that the standard is badly written, and should be fixed. Because before the standard is fixed, I claim that consume is unusable. We cannot trust it. End of story. We get exactly those same issues with control dependencies. The example gcc breakage was something like this: i = atomic_load(idx, memory_order_consume); x = array[0 + i - i]; Then gcc optimized this to: i = atomic_load(idx, memory_order_consume); x = array[0]; This same issue would hit control dependencies. You are free to argue that this is the fault of ARM and PowerPC memory ordering, but the fact remains that your suggested change has -exactly- the same vulnerability as memory_order_consume currently has. The fact that apparently gcc is currently buggy because it got the dependency calculations *wrong* just reinforces my point. The gcc bug Torvald pointed at is exactly because the current C standard is illogical unreadable CRAP. I can guarantee that what happened is: - the compiler saw that the result of the read was used as the left hand expression of the ternary ? : operator - as a result, the compiler decided that there's no dependency - the compiler didn't think about the dependency that comes from the result of the load *also* being used as the middle part of the ternary expression, because it had optimized it away, despite the standard not talking about that at all. - so the compiler never saw the dependency that the standard talks about No, the dependency was in a cancelling arithmetic expression as shown above, so that gcc optimized the dependency away. Then the ordering was lost on AARCH64. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT. My suggested language never had any of these problems, because *my* suggested semantics are clear, logical, and don't have these kinds of idiotic pit-falls. Solution: Fix the f*cking C standard. No excuses, no explanations. Just get it fixed. I agree that the standard needs help, but your suggested fix has the same problems as shown in the bugzilla. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 07:44:32PM +0100, Torvald Riegel wrote: xagsmtp3.20140220184514.1...@bldgate.vnet.ibm.com X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP3 at BLDGATE) On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote: But yes, the compiler guys would be extremely happy to simply drop memory_order_consume from the standard, as it is the memory order that they most love to hate. Getting them to agree to any sort of peep-hole optimization semantics for memory_order_consume is likely problematic. I wouldn't be so pessimistic about that. If the transformations can be shown to be always correct in terms of the semantics specified in the standard, and if the performance win is sufficiently large, why not? Of course, somebody has to volunteer to actually implement it :) I guess that there is only one way to find out. ;-) Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 20, 2014 at 11:45:29AM -0800, Linus Torvalds wrote: On Thu, Feb 20, 2014 at 10:56 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: The example gcc breakage was something like this: i = atomic_load(idx, memory_order_consume); x = array[0 + i - i]; Then gcc optimized this to: i = atomic_load(idx, memory_order_consume); x = array[0]; This same issue would hit control dependencies. You are free to argue that this is the fault of ARM and PowerPC memory ordering, but the fact remains that your suggested change has -exactly- the same vulnerability as memory_order_consume currently has. No it does not, for two reasons, first the legalistic (and bad) reason: As I actually described it, the consume becomes an acquire by default. If it's not used as an address to the dependent load, then it's an acquire. The use going away in no way makes the acquire go away in my simplistic model. So the compiler would actually translate that to a load-with-acquire, not be able to remove the acquire, and we have end of story. The actual code generation would be that ld + sync + ld on powerpc, or ld.acq on ARM. Now, the reason I claim that reason was legalistic and bad is that it's actually a cop-out, and if you had made the example be something like this: p = atomic_load(ptr, memory_order_consume); x = array[0 + p - p]; y = p-val; then yes, I actually think that the order of loads of 'x' and 'p' are not enforced by the consume. The only case that is clear is the order of 'y' and 'p', because that is the only one that really *USES* the value. The use of +p-p is syntactic bullshit. It's very obvious to even a slightly developmentally challenged hedgehog that +p-p doesn't have any actual *semantic* meaning, it's purely syntactic. And the syntactic meaning is meaningless and doesn't matter. Because I would just get rid of the whole dependency chain language ALTOGETHER. So in fact, in my world, I would consider your example to be a non-issue. In my world, there _is_ no dependency chain at a syntactic level. In my SANE world, none of that insane crap language exists. That language is made-up and tied to syntax exactly because it *cannot* be tied to semantics. In my sane world, consume has a much simpler meaning, and has no legalistic syntactic meaning: only real use matters. If the value can be optimized away, the so can the barrier, and so can the whole load. The value isn't consumed, so it has no meaning. So if you write i = atomic_load(idx, memory_order_consume); x = array[0+i-i]; then in my world that +i-i is meaningless. It's semantic fluff, and while my naive explanation would have left it as an acquire (because it cannot be peep-holed away), I actually do believe that the compiler should be obviously allowed to optimize the load away entirely since it's meaningless, and if no use of 'i' remains, then it has no consumer, and so there is no dependency. Put another way: consume is not about getting a lock, it's about getting a *value*. Only the dependency on the *value* matters, and if the value is optimized away, there is no dependency. And the value itself does not have any semantics. There's nothing volatile about the use of the value that would mean that the compiler cannot re-order it or remove it entirely. There's no barrier carried around by the value per se. The barrier is between the load and use. That's the *point* of consume after all. The whole chain of dependency language is pointless. It's wrong. It's complicated, it is illogical, and it causes subtle problems exactly because it got tied to the language *syntax* rather than to any logical use. Don't try to re-introduce the whole issue. It was a mistake for the C standard to talk about dependencies in the first place, exactly because it results in these idiotic legalistic practices. You do realize that that whole *(q+flag-flag) example in the bugzilla comes from the fact that the programmer tried to *fight* the fact that the C standard got the control dependency wrong? In other words, the *deepest* reason for that bugzilla is that the programmer tried to force the logical dependency by rewriting it as a (fake, and easily optimizable) data dependency. In *my* world, the stupid data-vs-control dependency thing goes away, the test of the value itself is a use of it, and *p ? *q :0 just does the right thing, there's no reason to do that q+flag-flag thing in the first place, and if you do, the compiler *should* just ignore your little games. Linus, given that you are calling me out for pushing legalistic and bad things, syntactic bullshit, and playing little games, I am forced to conclude that you have never attended any sort of standards-committee meeting. ;-) That said, I am fine with pushing control/data dependencies with this general approach
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote: On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote: On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com wrote: On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: And exactly because I know enough, I would *really* like atomics to be well-defined, and have very clear - and *local* - rules about how they can be combined and optimized. Local? Yes. So I think that one of the big advantages of atomics over volatile is that they *can* be optimized, and as such I'm not at all against trying to generate much better code than for volatile accesses. But at the same time, that can go too far. For example, one of the things we'd want to use atomics for is page table accesses, where it is very important that we don't generate multiple accesses to the values, because parts of the values can be change *by*hardware* (ie accessed and dirty bits). So imagine that you have some clever global optimizer that sees that the program never ever actually sets the dirty bit at all in any thread, and then uses that kind of non-local knowledge to make optimization decisions. THAT WOULD BE BAD. Might as well list other reasons why value proofs via whole-program analysis are unreliable for the Linux kernel: 1. As Linus said, changes from hardware. This is what's volatile is for, right? (Or the weak-volatile idea I mentioned). Compilers won't be able to prove something about the values of such variables, if marked (weak-)volatile. Yep. 2. Assembly code that is not visible to the compiler. Inline asms will -normally- let the compiler know what memory they change, but some just use the memory tag. Worse yet, I suspect that most compilers don't look all that carefully at .S files. Any number of other programs contain assembly files. Are the annotations of changed memory really a problem? If the memory tag exists, isn't that supposed to mean all memory? To make a proof about a program for location X, the compiler has to analyze all uses of X. Thus, as soon as X escapes into an .S file, then the compiler will simply not be able to prove a thing (except maybe due to the data-race-free requirement for non-atomics). The attempt to prove something isn't unreliable, simply because a correct compiler won't claim to be able to prove something. I am indeed less worried about inline assembler than I am about files full of assembly. Or files full of other languages. One reason that could corrupt this is that if program addresses objects other than through the mechanisms defined in the language. For example, if one thread lays out a data structure at a constant fixed memory address, and another one then uses the fixed memory address to get access to the object with a cast (e.g., (void*)0x123). Or if the program uses gcc linker scripts to get the same effect. 3. Kernel modules that have not yet been written. Now, the compiler could refrain from trying to prove anything about an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there is currently no way to communicate this information to the compiler other than marking the variable volatile. Even if the variable is just externally accessible, then the compiler knows that it can't do whole-program analysis about it. It is true that whole-program analysis will not be applicable in this case, but it will not be unreliable. I think that's an important difference. Let me make sure that I understand what you are saying. If my program has extern int foo;, the compiler will refrain from doing whole-program analysis involving foo? Yes. If it can't be sure to actually have the whole program available, it can't do whole-program analysis, right? Things like the linker scripts you mention or other stuff outside of the language semantics complicates this somewhat, and maybe some compilers assume too much. There's also the point that data-race-freedom is required for non-atomics even if those are shared with non-C-code. But except those corner cases, a compiler sees whether something escapes and becomes visible/accessible to other entities. The traditional response to except those corner cases is of course Murphy was an optimist. ;-) That said, point taken -- you
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 19, 2014 at 06:55:51PM +0100, Torvald Riegel wrote: On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote: On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote: [ . . . ] On both sides, the compiler will see that mmap() (or similar) is called, so that means the data escapes to something unknown, which could create threads and so on. So first, it can't do whole-program analysis for this state anymore, and has to assume that other C11 threads are accessing this memory. Next, lock-free atomics are specified to be address-free, meaning that they must work independent of where in memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a should and non-normative, but essential IMO). Thus, this then boils down to just a simple case of synchronization. (Of course, the rest of the ABI has to match too for the data exchange to work.) The compiler will see mmap() on the user side, but not on the kernel side. On the kernel side, something special is required. Maybe -- you'll certainly know better :) But maybe it's not that hard: For example, if the memory is in current code made available to userspace via calling some function with an asm implementation that the compiler can't analyze, then this should be sufficient. The kernel code would need to explicitly tell the compiler what portions of the kernel address space were covered by this. I would not want the compiler to have to work it out based on observing interactions with the page tables. ;-) Agree that address-free would be nice as shall rather than should. I echo Peter's question about how one tags functions like mmap(). I will also remember this for the next time someone on the committee discounts volatile. ;-) 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ This might be special, or not, depending on how the JITed code gets access to data. If this is via fixed addresses (e.g., (void*)0x123), then see above. If this is through function calls that the compiler can't analyze, then this is like 4. It could well be via the kernel reading its own symbol table, sort of a poor-person's reflection facility. I guess that would be for all intents and purposes equivalent to your (void*)0x123. If it is replacing code generated by the compiler, then yes. If the JIT is just filling in functions that had been undefined yet declared before, then the compiler will have seen the data escape through the function interfaces, and should be aware that there is other stuff. So one other concern would then be things things like ftrace, kprobes, ksplice, and so on. These rewrite the kernel binary at runtime, though in very limited ways. Yes. Nonetheless, I wouldn't see a problem if they, say, rewrite with C11-compatible code (and same ABI) on a function granularity (and when the function itself isn't executing concurrently) -- this seems to be similar to just having another compiler compile this particular function. Well, they aren't using C11-compatible code yet. They do patch within functions. And in some cases, they make staged sequences of changes to allow the patching to happen concurrently with other CPUs executing the code being patched. Not sure that any of the latter is actually in the kernel at the moment, but it has at least been prototyped and discussed. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel trie...@redhat.com wrote: On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: Can you point to it? Because I can find a draft standard, and it sure as hell does *not* contain any clarity of the model. It has a *lot* of verbiage, but it's pretty much impossible to actually understand, even for somebody who really understands memory ordering. http://www.cl.cam.ac.uk/~mjb220/n3132.pdf This has an explanation of the model up front, and then the detailed formulae in Section 6. This is from 2010, and there might have been smaller changes since then, but I'm not aware of any bigger ones. Ahh, this is different from what others pointed at. Same people, similar name, but not the same paper. I will read this version too, but from reading the other one and the standard in parallel and trying to make sense of it, it seems that I may have originally misunderstood part of the whole control dependency chain. The fact that the left side of ? :, and || breaks data dependencies made me originally think that the standard tried very hard to break any control dependencies. Which I felt was insane, when then some of the examples literally were about the testing of the value of an atomic read. The data dependency matters quite a bit. The fact that the other Mathematical paper then very much talked about consume only in the sense of following a pointer made me think so even more. But reading it some more, I now think that the whole data dependency logic (which is where the special left-hand side rule of the ternary and logical operators come in) are basically an exception to the rule that sequence points end up being also meaningful for ordering (ok, so C11 seems to have renamed sequence points to sequenced before). So while an expression like atomic_read(p, consume) ? a : b; doesn't have a data dependency from the atomic read that forces serialization, writing if (atomic_read(p, consume)) a; else b; the standard *does* imply that the atomic read is happens-before wrt a, and I'm hoping that there is no question that the control dependency still acts as an ordering point. The control dependency should order subsequent stores, at least assuming that a and b don't start off with identical stores that the compiler could pull out of the if and merge. The same might also be true for ?: for all I know. (But see below) That said, in this case, you could substitute relaxed for consume and get the same effect. The return value from atomic_read() gets absorbed into the if condition, so there is no dependency-ordered-before relationship, so nothing for consume to do. One caution... The happens-before relationship requires you to trace a full path between the two operations of interest. This is illustrated by the following example, with both x and y initially zero: T1: atomic_store_explicit(x, 1, memory_order_relaxed); r1 = atomic_load_explicit(y, memory_order_relaxed); T2: atomic_store_explicit(y, 1, memory_order_relaxed); r2 = atomic_load_explicit(x, memory_order_relaxed); There is a happens-before relationship between T1's load and store, and another happens-before relationship between T2's load and store, but there is no happens-before relationship from T1 to T2, and none in the other direction, either. And you don't get to assume any ordering based on reasoning about these two disjoint happens-before relationships. So it is quite possible for r1==1r2==1 after both threads complete. Which should be no surprise: This misordering can happen even on x86, which would need a full smp_mb() to prevent it. THAT was one of my big confusions, the discussion about control dependencies and the fact that the logical ops broke the data dependency made me believe that the standard tried to actively avoid the whole issue with control dependencies can break ordering dependencies on some CPU's due to branch prediction and memory re-ordering by the CPU. But after all the reading, I'm starting to think that that was never actually the implication at all, and the logical ops breaks the data dependency rule is simply an exception to the sequence point rule. All other sequence points still do exist, and do imply an ordering that matters for consume Am I now reading it right? As long as there is an unbroken chain of -data- dependencies from the consume to the later access in question, and as long as that chain doesn't go through the excluded operations, yes. So the clarification is basically to the statement that the if (consume(p)) a version *would* have an ordering guarantee between the read of p and a, but the consume(p) ? a : b would *not* have such an ordering guarantee. Yes? Neither has a data-dependency guarantee, because there is no data dependency from the load to either a or b.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: Several of you have said that the standard and compiler should not permit speculative writes of atomics, or (effectively) that the compiler should preserve dependencies. In simple examples it's easy to see what that means, but in general it's not so clear what the language should guarantee, because dependencies may go via non-atomic code in other compilation units, and we have to consider the extent to which it's desirable to limit optimisation there. For example, suppose we have, in one compilation unit: void f(int ra, int*rb) { if (ra==42) *rb=42; else *rb=42; } Hello, Peter! Nice example! The relevant portion of Documentation/memory-barriers.txt in my -rcu tree says the following about the control dependency in the above construct: q = ACCESS_ONCE(a); if (q) { barrier(); ACCESS_ONCE(b) = p; do_something(); } else { barrier(); ACCESS_ONCE(b) = p; do_something_else(); } The initial ACCESS_ONCE() is required to prevent the compiler from proving the value of 'a', and the pair of barrier() invocations are required to prevent the compiler from pulling the two identical stores to 'b' out from the legs of the if statement. So yes, current compilers need significant help if it is necessary to maintain dependencies in that sort of code. Similar examples came up in the data-dependency discussions in the standards committee, which led to the [[carries_dependency]] attribute for C11 and C++11. Of course, current compilers don't have this attribute, and the current Linux kernel code doesn't have any other marking for data dependencies passing across function boundaries. (Maybe some time as an assist for detecting pointer leaks out of RCU read-side critical sections, but efforts along those lines are a bit stalled at the moment.) More on data dependencies below... and in another compilation unit the bodies of two threads: // Thread 0 r1 = x; f(r1,r2); y = r2; // Thread 1 r3 = y; f(r3,r4); x = r4; where accesses to x and y are annotated C11 atomic memory_order_relaxed or Linux ACCESS_ONCE(), accesses to r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. (Of course, this is an artificial example, to make the point below as simply as possible - in real code the branches of the conditional might not be syntactically identical, just equivalent after macro expansion and other optimisation.) In the source program there's a dependency from the read of x to the write of y in Thread 0, and from the read of y to the write of x on Thread 1. Dependency-respecting compilation would preserve those and the ARM and POWER architectures both respect them, so the reads of x and y could not give 42. But a compiler might well optimise the (non-atomic) body of f() to just *rb=42, making the threads effectively // Thread 0 r1 = x; y = 42; // Thread 1 r3 = y; x = 42; (GCC does this at O1, O2, and O3) and the ARM and POWER architectures permit those two reads to see 42. That is moreover actually observable on current ARM hardware. I do agree that this could happen on current compilers and hardware. Agreed, but as Peter Zijlstra noted in this thread, this optimization is to a control dependency, not a data dependency. So as far as we can see, either: 1) if you can accept the latter behaviour (if the Linux codebase does not rely on its absence), the language definition should permit it, and current compiler optimisations can be used, or 2) otherwise, the language definition should prohibit it but the compiler would have to preserve dependencies even in compilation units that have no mention of atomics. It's unclear what the (runtime and compiler development) cost of that would be in practice - perhaps Torvald could comment? For current compilers, we have to rely on coding conventions within the Linux kernel in combination with non-standard extentions to gcc and specified compiler flags to disable undesirable behavior. I have a start on specifying this in a document I am preparing for the standards committee, a very early draft of which may be found here: http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf Section 3 shows the results of a manual scan through the Linux kernel's dependency chains, and Section 4.1 lists a probably incomplete (and no doubt erroneous) list of coding standards required to make dependency chains work on current compilers. Any comments and suggestions are more than welcome! For more context, this example is taken from a summary of
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 03:33:35PM +, Peter Sewell wrote: Hi Paul, On 18 February 2014 14:56, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: Several of you have said that the standard and compiler should not permit speculative writes of atomics, or (effectively) that the compiler should preserve dependencies. In simple examples it's easy to see what that means, but in general it's not so clear what the language should guarantee, because dependencies may go via non-atomic code in other compilation units, and we have to consider the extent to which it's desirable to limit optimisation there. For example, suppose we have, in one compilation unit: void f(int ra, int*rb) { if (ra==42) *rb=42; else *rb=42; } Hello, Peter! Nice example! The relevant portion of Documentation/memory-barriers.txt in my -rcu tree says the following about the control dependency in the above construct: q = ACCESS_ONCE(a); if (q) { barrier(); ACCESS_ONCE(b) = p; do_something(); } else { barrier(); ACCESS_ONCE(b) = p; do_something_else(); } The initial ACCESS_ONCE() is required to prevent the compiler from proving the value of 'a', and the pair of barrier() invocations are required to prevent the compiler from pulling the two identical stores to 'b' out from the legs of the if statement. thanks So yes, current compilers need significant help if it is necessary to maintain dependencies in that sort of code. Similar examples came up in the data-dependency discussions in the standards committee, which led to the [[carries_dependency]] attribute for C11 and C++11. Of course, current compilers don't have this attribute, and the current Linux kernel code doesn't have any other marking for data dependencies passing across function boundaries. (Maybe some time as an assist for detecting pointer leaks out of RCU read-side critical sections, but efforts along those lines are a bit stalled at the moment.) More on data dependencies below... and in another compilation unit the bodies of two threads: // Thread 0 r1 = x; f(r1,r2); y = r2; // Thread 1 r3 = y; f(r3,r4); x = r4; where accesses to x and y are annotated C11 atomic memory_order_relaxed or Linux ACCESS_ONCE(), accesses to r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. (Of course, this is an artificial example, to make the point below as simply as possible - in real code the branches of the conditional might not be syntactically identical, just equivalent after macro expansion and other optimisation.) In the source program there's a dependency from the read of x to the write of y in Thread 0, and from the read of y to the write of x on Thread 1. Dependency-respecting compilation would preserve those and the ARM and POWER architectures both respect them, so the reads of x and y could not give 42. But a compiler might well optimise the (non-atomic) body of f() to just *rb=42, making the threads effectively // Thread 0 r1 = x; y = 42; // Thread 1 r3 = y; x = 42; (GCC does this at O1, O2, and O3) and the ARM and POWER architectures permit those two reads to see 42. That is moreover actually observable on current ARM hardware. I do agree that this could happen on current compilers and hardware. Agreed, but as Peter Zijlstra noted in this thread, this optimization is to a control dependency, not a data dependency. Indeed. In principle (again as Hans has observed) a compiler might well convert between the two, e.g. if operating on single-bit values, or where value-range analysis has shown that a variable can only contain one of a small set of values. I don't know whether that happens in practice? Then there are also cases where a compiler is very likely to remove data/address dependencies, eg if some constant C is #define'd to be 0 then an array access indexed by x * C will have the dependency on x removed. The runtime and compiler development costs of preventing that are also unclear to me. Given that, whether it's reasonable to treat control and data dependencies differently seems to be an open question. Here is another (admittedly fanciful and probably buggy) implementation of f() that relies on data dependencies (according to C11 and C++11), but which could not be relied on to preserve thosse data dependencies given current pre-C11 compilers: int arr[2] = { 42, 43 }; int *bigarr; int f
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 04:56:40PM +0100, Torvald Riegel wrote: On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote: On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote: On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel trie...@redhat.com wrote: Which example do you have in mind here? Haven't we resolved all the debated examples, or did I miss any? Well, Paul seems to still think that the standard possibly allows speculative writes or possibly value speculation in ways that break the hardware-guaranteed orderings. It is not that I know of any specific problems, but rather that I know I haven't looked under all the rocks. Plus my impression from my few years on the committee is that the standard will be pushed to the limit when it comes time to add optimizations. One example that I learned about last week uses the branch-prediction hardware to validate value speculation. And no, I am not at all a fan of value speculation, in case you were curious. However, it is still an educational example. This is where you start: p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ do_something(p-a, p-b, p-c); p-d = 1; I assume that's the source code. Yep! Then you leverage branch-prediction hardware as follows: p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ if (p == GUESS) { do_something(GUESS-a, GUESS-b, GUESS-c); GUESS-d = 1; } else { do_something(p-a, p-b, p-c); p-d = 1; } I assume that this is a potential transformation by a compiler. Again, yep! The CPU's branch-prediction hardware squashes speculation in the case where the guess was wrong, and this prevents the speculative store to -d from ever being visible. However, the then-clause breaks dependencies, which means that the loads -could- be speculated, so that do_something() gets passed pre-initialization values. Now, I hope and expect that the wording in the standard about dependency ordering prohibits this sort of thing. But I do not yet know for certain. The transformation would be incorrect. p-a in the source code carries a dependency, and as you say, the transformed code wouldn't have that dependency any more. So the transformed code would loose ordering constraints that it has in the virtual machine, so in the absence of other proofs of correctness based on properties not shown in the example, the transformed code would not result in the same behavior as allowed by the abstract machine. Glad that you agree! ;-) If the transformation would actually be by a programmer, then this wouldn't do the same as the first example because mo_consume doesn't work through the if statement. Agreed. Are there other specified concerns that you have regarding this example? Nope. Just generalized paranoia. (But just because I am paranoid doesn't mean that there isn't a bug lurking somewhere in the standard, the compiler, the kernel, or my own head!) I will likely have more once I start mapping Linux kernel atomics to the C11 standard. One more paper past N3934 comes first, though. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote: On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel trie...@redhat.com wrote: There's an underlying problem here that's independent from the actual instance that you're worried about here: no sense is a ultimately a matter of taste/objectives/priorities as long as the respective specification is logically consistent. Yes. But I don't think it's independent. Exactly *because* some people will read standards without applying does the resulting code generation actually make sense for the programmer that wrote the code, the standard has to be pretty clear. The standard often *isn't* pretty clear. It wasn't clear enough when it came to volatile, and yet that was a *much* simpler concept than atomic accesses and memory ordering. And most of the time it's not a big deal. But because the C standard generally tries to be very portable, and cover different machines, there tends to be a mindset that anything inherently unportable is undefined or implementation defined, and then the compiler writer is basically given free reign to do anything they want (with implementation defined at least requiring that it is reliably the same thing). Yes, that's how it works in general. And this makes sense, because all optimizations rely on that. Second, you can't keep something consistent (eg, between compilers) if it isn't specified. So if we want stricter rules, those need to be specified somewhere. And when it comes to memory ordering, *everything* is basically non-portable, because different CPU's very much have different rules. Well, the current set of memory orders (and the memory model as a whole) is portable, even though it might not allow to exploit all hardware properties, and thus might perform sub-optimally in some cases. I worry that that means that the standard then takes the stance that well, compiler re-ordering is no worse than CPU re-ordering, so we let the compiler do anything. And then we have to either add volatile to make sure the compiler doesn't do that, or use an overly strict memory model at the compiler level that makes it all pointless. Using volatile is not a good option, I think, because synchronization between threads should be orthogonal to observable output of the abstract machine. Are you thinking of volatile -instead- of atomics? My belief is that given the current standard there will be times that we need to use volatile -in- -addition- to atomics. The current memory model might not allow to exploit all hardware properties, I agree. But then why don't we work on how to extend it to do so? We need to specify the behavior we want anyway, and this can't be independent of the language semantics, so it has to be conceptually integrated with the standard anyway. So I really really hope that the standard doesn't give compiler writers free hands to do anything that they can prove is equivalent in the virtual C machine model. It does, but it also doesn't mean this can't be extended. So let's focus on whether we can find an extension. That's not how you get reliable results. In this general form, that's obviously a false claim. These two sentences starkly illustrate the difference in perspective between you two. You are talking past each other. Not sure how to fix this at the moment, but what else is new? ;-) Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com wrote: On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: And exactly because I know enough, I would *really* like atomics to be well-defined, and have very clear - and *local* - rules about how they can be combined and optimized. Local? Yes. So I think that one of the big advantages of atomics over volatile is that they *can* be optimized, and as such I'm not at all against trying to generate much better code than for volatile accesses. But at the same time, that can go too far. For example, one of the things we'd want to use atomics for is page table accesses, where it is very important that we don't generate multiple accesses to the values, because parts of the values can be change *by*hardware* (ie accessed and dirty bits). So imagine that you have some clever global optimizer that sees that the program never ever actually sets the dirty bit at all in any thread, and then uses that kind of non-local knowledge to make optimization decisions. THAT WOULD BE BAD. Might as well list other reasons why value proofs via whole-program analysis are unreliable for the Linux kernel: 1. As Linus said, changes from hardware. 2. Assembly code that is not visible to the compiler. Inline asms will -normally- let the compiler know what memory they change, but some just use the memory tag. Worse yet, I suspect that most compilers don't look all that carefully at .S files. Any number of other programs contain assembly files. 3. Kernel modules that have not yet been written. Now, the compiler could refrain from trying to prove anything about an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there is currently no way to communicate this information to the compiler other than marking the variable volatile. Other programs have similar issues, e.g., via dlopen(). 4. Some drivers allow user-mode code to mmap() some of their state. Any changes undertaken by the user-mode code would be invisible to the compiler. 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ And probably other stuff as well. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 03:16:33PM +, Mark Batty wrote: Hi Paul, Thanks for the document. I'm looking forward to reading the bits about dependency chains in Linux. And I am looking forward to your thoughts on those bits! One point of confusion for me... Example 4 says language must allow. Shouldn't that be language is permitted to allow? When we say allow, we mean that the optimised execution should be allowed by the specification, and Implicitly, the unoptimised execution should remain allowed too. We want to be concrete about what the language specification allows, and that's why we say must. It is not to disallow the unoptimised execution. OK, got it! Thanx, Paul Seems like an implementation is always within its rights to avoid an optimization if its implementation prevents it from safely detecting the oppportunity for that optimization. That's right. - Mark Or am I missing something here? Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 09:44:48AM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel trie...@redhat.com wrote: [ . . . ] The standard is clear on what's required. I strongly suggest reading the formalization of the memory model by Batty et al. Can you point to it? Because I can find a draft standard, and it sure as hell does *not* contain any clarity of the model. It has a *lot* of verbiage, but it's pretty much impossible to actually understand, even for somebody who really understands memory ordering. I suspect he is thinking of the following: Mathematizing C++ Concurrency. Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. https://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf Even if you don't like the math, it contains some very good examples. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 06:23:47PM +, Peter Sewell wrote: On 18 February 2014 17:16, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com wrote: On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: And exactly because I know enough, I would *really* like atomics to be well-defined, and have very clear - and *local* - rules about how they can be combined and optimized. Local? Yes. So I think that one of the big advantages of atomics over volatile is that they *can* be optimized, and as such I'm not at all against trying to generate much better code than for volatile accesses. But at the same time, that can go too far. For example, one of the things we'd want to use atomics for is page table accesses, where it is very important that we don't generate multiple accesses to the values, because parts of the values can be change *by*hardware* (ie accessed and dirty bits). So imagine that you have some clever global optimizer that sees that the program never ever actually sets the dirty bit at all in any thread, and then uses that kind of non-local knowledge to make optimization decisions. THAT WOULD BE BAD. Might as well list other reasons why value proofs via whole-program analysis are unreliable for the Linux kernel: 1. As Linus said, changes from hardware. 2. Assembly code that is not visible to the compiler. Inline asms will -normally- let the compiler know what memory they change, but some just use the memory tag. Worse yet, I suspect that most compilers don't look all that carefully at .S files. Any number of other programs contain assembly files. 3. Kernel modules that have not yet been written. Now, the compiler could refrain from trying to prove anything about an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there is currently no way to communicate this information to the compiler other than marking the variable volatile. Other programs have similar issues, e.g., via dlopen(). 4. Some drivers allow user-mode code to mmap() some of their state. Any changes undertaken by the user-mode code would be invisible to the compiler. 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ And probably other stuff as well. interesting list. So are you saying that value-range-analysis and such-like (I say glibly, without really knowing what such-like refers to here) are fundamentally incompatible with the kernel code, or can you think of some way to tell the compiler a bound on the footprint of the unseen changes in each of those cases? Other than the volatile keyword, no. Well, I suppose you could also create a function that changed the variables in question, then arrange to never call it, but in such a way that the compiler could not prove that it was never called. But ouch! Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:49:27AM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell peter.sew...@cl.cam.ac.uk wrote: This is a bit more subtle, because (on ARM and POWER) removing the dependency and conditional branch is actually in general *not* equivalent in the hardware, in a concurrent context. So I agree, but I think that's a generic issue with non-local memory ordering, and is not at all specific to the optimization wrt that x?42:42 expression. If you have a value that you loaded with a non-relaxed load, and you pass that value off to a non-local function that you don't know what it does, in my opinion that implies that the compiler had better add the necessary serialization to say whatever that other function does, we guarantee the semantics of the load. So on ppc, if you do a load with consume or acquire and then call another function without having had something in the caller that serializes the load, you'd better add the lwsync or whatever before the call. Exactly because the function call itself otherwise basically breaks the visibility into ordering. You've basically turned a load-with-ordering-guarantees into just an integer that you passed off to something that doesn't know about the ordering guarantees - and you need that lwsync in order to still guarantee the ordering. Tough titties. That's what a CPU with weak memory ordering semantics gets in order to have sufficient memory ordering. And that is in fact what C11 compilers are supposed to do if the function doesn't have the [[carries_dependency]] attribute on the corresponding argument or return of the non-local function. If the function is marked with [[carries_dependency]], then the compiler has the information needed in both compilations to make things work correctly. Thanx, Paul And I don't think it's actually a problem in practice. If you are doing loads with ordered semantics, you're not going to pass the result off willy-nilly to random functions (or you really *do* require the ordering, because the load that did the acquire was actually for a lock! So I really think that the local optimization is correct regardless. Linus
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 09:43:31PM +0100, Torvald Riegel wrote: xagsmtp5.20140218204423.3...@bldgate.vnet.ibm.com X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP5 at BLDGATE) On Tue, 2014-02-18 at 12:12 +, Peter Sewell wrote: Several of you have said that the standard and compiler should not permit speculative writes of atomics, or (effectively) that the compiler should preserve dependencies. In simple examples it's easy to see what that means, but in general it's not so clear what the language should guarantee, because dependencies may go via non-atomic code in other compilation units, and we have to consider the extent to which it's desirable to limit optimisation there. [...] 2) otherwise, the language definition should prohibit it but the compiler would have to preserve dependencies even in compilation units that have no mention of atomics. It's unclear what the (runtime and compiler development) cost of that would be in practice - perhaps Torvald could comment? If I'm reading the standard correctly, it requires that data dependencies are preserved through loads and stores, including nonatomic ones. That sounds convenient because it allows programmers to use temporary storage. However, what happens if a dependency arrives at a store for which the alias set isn't completely known? Then we either have to add a barrier to enforce the ordering at this point, or we have to assume that all other potentially aliasing memory locations would also have to start carrying dependencies (which might be in other functions in other compilation units). Neither option is good. The first might introduce barriers in places in which they might not be required (or the programmer has to use kill_dependency() quite often to avoid all these). The second is bad because points-to analysis is hard, so in practice the points-to set will not be precisely known for a lot of pointers. So this might not just creep into other functions via calls of [[carries_dependency]] functions, but also through normal loads and stores, likely prohibiting many optimizations. I cannot immediately think of a situation where a store carrying a dependency into a non-trivially aliased object wouldn't be a usage error, so perhaps emitting a barrier and a diagnostic at that point is best. Furthermore, the dependency tracking can currently only be disabled/enabled on a function granularity (via [[carries_dependency]]). Thus, if we have big functions, then dependency tracking may slow down a lot of code in the big function. If we have small functions, there's a lot of attributes to be added. If a function may only carry a dependency but doesn't necessarily (eg, depending on input parameters), then the programmer has to make a trade-off whether he/she want's to benefit from mo_consume but slow down other calls due to additional barriers (ie, when this function is called from non-[[carries_dependency]] functions), or vice versa. (IOW, because of the function granularity, other code's performance is affected.) If a compiler wants to implement dependency tracking just for a few constructs (e.g., operators - + ...) and use barriers otherwise, then this decision must be compatible with how all this is handled in other compilation units. Thus, compiler optimizations effectively become part of the ABI, which doesn't seem right. I hope these examples illustrate my concerns about the implementability in practice of this. It's also why I've suggested to move from an opt-out approach as in the current standard (ie, with kill_dependency()) to an opt-in approach for conservative dependency tracking (e.g., with a preserve_dependencies(exp) call, where exp will not be optimized in a way that removes any dependencies). This wouldn't help with many optimizations being prevented, but it should at least help programmers contain the problem to smaller regions of code. I'm not aware of any implementation that tries to track dependencies, so I can't give any real performance numbers. This could perhaps be simulated, but I'm not sure whether a realistic case would be made without at least supporting [[carries_dependency]] properly in the compiler, which would be some work. Another approach would be to use start-tracking/stop-tracking directives that could be buried into rcu_read_lock() and rcu_read_unlock(). There are issues with nesting and conditional use of rcu_read_lock() and rcu_read_unlock(), but it does give you nicer granularity properties. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: xagsmtp4.20140218214207.8...@vmsdvm9.vnet.ibm.com X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel trie...@redhat.com wrote: On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: And exactly because I know enough, I would *really* like atomics to be well-defined, and have very clear - and *local* - rules about how they can be combined and optimized. Local? Yes. So I think that one of the big advantages of atomics over volatile is that they *can* be optimized, and as such I'm not at all against trying to generate much better code than for volatile accesses. But at the same time, that can go too far. For example, one of the things we'd want to use atomics for is page table accesses, where it is very important that we don't generate multiple accesses to the values, because parts of the values can be change *by*hardware* (ie accessed and dirty bits). So imagine that you have some clever global optimizer that sees that the program never ever actually sets the dirty bit at all in any thread, and then uses that kind of non-local knowledge to make optimization decisions. THAT WOULD BE BAD. Might as well list other reasons why value proofs via whole-program analysis are unreliable for the Linux kernel: 1. As Linus said, changes from hardware. This is what's volatile is for, right? (Or the weak-volatile idea I mentioned). Compilers won't be able to prove something about the values of such variables, if marked (weak-)volatile. Yep. 2. Assembly code that is not visible to the compiler. Inline asms will -normally- let the compiler know what memory they change, but some just use the memory tag. Worse yet, I suspect that most compilers don't look all that carefully at .S files. Any number of other programs contain assembly files. Are the annotations of changed memory really a problem? If the memory tag exists, isn't that supposed to mean all memory? To make a proof about a program for location X, the compiler has to analyze all uses of X. Thus, as soon as X escapes into an .S file, then the compiler will simply not be able to prove a thing (except maybe due to the data-race-free requirement for non-atomics). The attempt to prove something isn't unreliable, simply because a correct compiler won't claim to be able to prove something. I am indeed less worried about inline assembler than I am about files full of assembly. Or files full of other languages. One reason that could corrupt this is that if program addresses objects other than through the mechanisms defined in the language. For example, if one thread lays out a data structure at a constant fixed memory address, and another one then uses the fixed memory address to get access to the object with a cast (e.g., (void*)0x123). Or if the program uses gcc linker scripts to get the same effect. 3. Kernel modules that have not yet been written. Now, the compiler could refrain from trying to prove anything about an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there is currently no way to communicate this information to the compiler other than marking the variable volatile. Even if the variable is just externally accessible, then the compiler knows that it can't do whole-program analysis about it. It is true that whole-program analysis will not be applicable in this case, but it will not be unreliable. I think that's an important difference. Let me make sure that I understand what you are saying. If my program has extern int foo;, the compiler will refrain from doing whole-program analysis involving foo? Or to ask it another way, when you say whole-program analysis, are you restricting that analysis to the current translation unit? If so, I was probably not the only person thinking that you instead meant analysis across all translation units linked into the program. ;-) Other programs have similar issues, e.g., via dlopen(). 4. Some drivers allow user-mode code to mmap() some of their state. Any changes undertaken by the user-mode code would be invisible to the compiler. A good point, but a compiler that doesn't try to (incorrectly) assume something about the semantics of mmap will simply see that the mmap'ed data will escape to stuff if can't analyze, so it will not be able to make a proof. This is different from, for example, malloc(), which is guaranteed to return fresh nonaliasing memory. As Peter noted, this is the other end of mmap(). The -user- code sees that there is an mmap(), but the kernel code invokes functions that poke values into hardware registers
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote: On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: You need volatile semantics to force the compiler to ignore any proofs it might otherwise attempt to construct. Hence all the ACCESS_ONCE() calls in my email to Torvald. (Hopefully I translated your example reasonably.) My brain gave out for today; but it did appear to have the right structure. I can relate. ;-) I would prefer it C11 would not require the volatile casts. It should simply _never_ speculate with atomic writes, volatile or not. I agree with not needing volatiles to prevent speculated writes. However, they will sometimes be needed to prevent excessive load/store combining. The compiler doesn't have the runtime feedback mechanisms that the hardware has, and thus will need help from the developer from time to time. Or maybe the Linux kernel simply waits to transition to C11 relaxed atomics until the compiler has learned to be sufficiently conservative in its load-store combining decisions. Thanx, Paul
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote: On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote: On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel trie...@redhat.com wrote: I think a major benefit of C11's memory model is that it gives a *precise* specification for how a compiler is allowed to optimize. Clearly it does *not*. This whole discussion is proof of that. It's not at all clear, It might not be an easy-to-understand specification, but as far as I'm aware it is precise. The Cambridge group's formalization certainly is precise. From that, one can derive (together with the usual rules for as-if etc.) what a compiler is allowed to do (assuming that the standard is indeed precise). My replies in this discussion have been based on reasoning about the standard, and not secret knowledge (with the exception of no-out-of-thin-air, which is required in the standard's prose but not yet formalized). I agree that I'm using the formalization as a kind of placeholder for the standard's prose (which isn't all that easy to follow for me either), but I guess there's no way around an ISO standard using prose. If you see a case in which the standard isn't precise, please bring it up or open a C++ CWG issue for it. I suggest that I go through the Linux kernel's requirements for atomics and memory barriers and see how they map to C11 atomics. With that done, we would have very specific examples to go over. Without that done, the discussion won't converge very well. Seem reasonable? Thanx, Paul and the standard apparently is at least debatably allowing things that shouldn't be allowed. Which example do you have in mind here? Haven't we resolved all the debated examples, or did I miss any? It's also a whole lot more complicated than volatile, so the likelihood of a compiler writer actually getting it right - even if the standard does - is lower. It's not easy, that's for sure, but none of the high-performance alternatives are easy either. There are testing tools out there based on the formalization of the model, and we've found bugs with them. And the alternative of using something not specified by the standard is even worse, I think, because then you have to guess what a compiler might do, without having any constraints; IOW, one is resorting to no sane compiler would do that, and that doesn't seem to very robust either.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote: On February 17, 2014 7:18:15 PM GMT+01:00, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote: On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: You need volatile semantics to force the compiler to ignore any proofs it might otherwise attempt to construct. Hence all the ACCESS_ONCE() calls in my email to Torvald. (Hopefully I translated your example reasonably.) My brain gave out for today; but it did appear to have the right structure. I can relate. ;-) I would prefer it C11 would not require the volatile casts. It should simply _never_ speculate with atomic writes, volatile or not. I agree with not needing volatiles to prevent speculated writes. However, they will sometimes be needed to prevent excessive load/store combining. The compiler doesn't have the runtime feedback mechanisms that the hardware has, and thus will need help from the developer from time to time. Or maybe the Linux kernel simply waits to transition to C11 relaxed atomics until the compiler has learned to be sufficiently conservative in its load-store combining decisions. Sounds backwards. Currently the compiler does nothing to the atomics. I'm sure we'll eventually add something. But if testing coverage is zero outside then surely things get worse, not better with time. Perhaps we solve this chicken-and-egg problem by creating a test suite? Thanx, Paul