Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
GCC 4.3.2. Maybe i missed something obvious? The typical use case of restrict is to tell it that multiple given arrays are independent and then give the loop optimizer more freedom to handle expressions in the loop that accesses these arrays. Since there are no loops in the list functions nothing changed. Ok presumably there are some other optimizations which rely on that alias information too, but again the list_* stuff is probably too simple to trigger any of them. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen wrote: GCC 4.3.2. Maybe i missed something obvious? The typical use case of restrict is to tell it that multiple given arrays are independent and then give the loop optimizer more freedom to handle expressions in the loop that accesses these arrays. Since there are no loops in the list functions nothing changed. Ok presumably there are some other optimizations which rely on that alias information too, but again the list_* stuff is probably too simple to trigger any of them. Any function that does several interleaved loads and stores through different pointers could have much more freedom to move loads early and stores late. Big OOOE CPUs won't care so much, but embedded and things (including in-order x86) are very important users of the kernel. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Wed, Jan 21, 2009 at 10:20:49AM +0100, Andi Kleen wrote: On Wed, Jan 21, 2009 at 09:52:08AM +0100, Nick Piggin wrote: On Wed, Jan 21, 2009 at 09:54:02AM +0100, Andi Kleen wrote: GCC 4.3.2. Maybe i missed something obvious? The typical use case of restrict is to tell it that multiple given arrays are independent and then give the loop optimizer more freedom to handle expressions in the loop that accesses these arrays. Since there are no loops in the list functions nothing changed. Ok presumably there are some other optimizations which rely on that alias information too, but again the list_* stuff is probably too simple to trigger any of them. Any function that does several interleaved loads and stores through different pointers could have much more freedom to move loads early and stores late. For once that would require more live registers. It's not a clear and obvious win. Especially not if you have only very little registers, like on 32bit x86. Then it would typically increase code size. The point is that the compiler is then free to do it. If things slow down after the compiler gets *more* information, then that is a problem with the compiler heuristics rather than the information we give it. Then x86s tend to have very very fast L1 caches and if something is not in L1 on reads then the cost of fetching something for a read dwarfs the few cycles you can typically get out of this. Well most architectures have L1 caches of several cycles. And L2 miss typically means going to L2 which in some cases the compiler is expected to attempt to cover as much as possible (eg in-order architectures). If the caches are missed completely, then especially with an in-order architecture, you want to issue as many parallel loads as possible during the stall. If the compiler can't resolve aliases, then it simply won't be able to bring some of those loads forward. And lastly even on a in order system stores can be typically queued without stalling, so it doesn't hurt to do them early. Store queues are, what? On the order of tens of entries for big power hungry x86? I'd guess much smaller for low power in-order x86 and ARM etc. These can definitely fill up and stall, so you still want to get loads out early if possible. Even a lot of OOOE CPUs I think won't have the best alias anaysis, so all else being equal, it wouldn't hurt them to move loads earlier. Also at least x86 gcc normally doesn't do scheduling beyond basic blocks, so any if () shuts it up. I don't think any of this is a reason not to use restrict, though. But... there are so many places we could add it to the kernel, and probably so few where it makes much difference. Maybe it should be able to help some critical core code, though. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
The point is that the compiler is then free to do it. If things slow down after the compiler gets *more* information, then that is a problem with the compiler heuristics rather than the information we give it. The point was the -Os disables it typically then. (not always, compiler heuristics are far from perfect) Then x86s tend to have very very fast L1 caches and if something is not in L1 on reads then the cost of fetching something for a read dwarfs the few cycles you can typically get out of this. Well most architectures have L1 caches of several cycles. And L2 miss typically means going to L2 which in some cases the compiler is expected to attempt to cover as much as possible (eg in-order architectures). L2 cache is so much slower that scheduling a few instructions more doesn't help much. stall, so you still want to get loads out early if possible. Even a lot of OOOE CPUs I think won't have the best alias anaysis, so all else being equal, it wouldn't hurt them to move loads earlier. Hmm, but if the load is nearby it won't matter if a store is in the middle, because the CPU will just execute over it. The real big win is if you do some computation inbetween, but at least for typical list manipulation there isn't really any. Also at least x86 gcc normally doesn't do scheduling beyond basic blocks, so any if () shuts it up. I don't think any of this is a reason not to use restrict, though. But... there are so many places we could add it to the kernel, and probably so few where it makes much difference. Maybe it should be able to help some critical core code, though. Frankly I think it would be another unlikely(). -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Tue, 2009-01-20 at 13:38 +0100, Ingo Molnar wrote: * Nick Piggin npig...@suse.de wrote: it seems like a nice opt-in thing that can be used where the aliases are verified and the code is particularly performance critical... Yes. I think we could use it in the kernel, although I'm not sure how many cases we would ever find where we really care. Yeah, we don't tend to do a lot of intensive data processing, so it is normally the cache misses that hurt most as you noted earlier. Some places it might be appropriate, though. It might be nice if it can bring code size down too... I checked, its size effects were miniscule [0.17%] on the x86 defconfig kernel and it seems to be a clear loss in total cost as there would be an ongoing maintenance cost They were talking about 'restrict', not strict-aliasing. Where it can be used, it's going to give you optimisations that strict-aliasing can't. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Ingo Molnar wrote: Hm, GCC uses __restrict__, right? I'm wondering whether there's any internal tie-up between alias analysis and the __restrict__ keyword - so if we turn off aliasing optimizations the __restrict__ keyword's optimizations are turned off as well. Actually I suspect that restrict makes little difference for inlines or even statics, since gcc generally can do alias analysis fine there. However, in the presence of an intermodule function call, all alias analysis is off. This is presumably why type-based analysis is used at all ... to at least be able to a modicum of, say, loop invariant removal in the presence of a library call. This is also where restrict comes into play. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Tue, Jan 20, 2009 at 08:01:52AM +1100, Linus Torvalds wrote: On Mon, 19 Jan 2009, Nick Piggin wrote: I want to know what is the problem with the restrict keyword? I'm sure I've read Linus ranting about how bad it is in the past... No, I don't think I've ranted about 'restrict'. I think it's a reasonable solution for performance-critical code, and unlike the whole type-aliasing model, it actually works for the _sane_ cases (ie doing some operation over two arrays of the same type, and letting the compiler know that it can access the arrays without fearing that writing to one would affect reading from the other). The problem with 'restrict' is that almost nobody uses it, and it does obviously require programmer input rather than the compiler doing it automatically. But it should work well as a way to get Fortran-like performance from HPC workloads written in C - which is where most of the people are who really want the alias analysis. OK, that makes sense. I just had a vague feeling that you disliked it. it seems like a nice opt-in thing that can be used where the aliases are verified and the code is particularly performance critical... Yes. I think we could use it in the kernel, although I'm not sure how many cases we would ever find where we really care. Yeah, we don't tend to do a lot of intensive data processing, so it is normally the cache misses that hurt most as you noted earlier. Some places it might be appropriate, though. It might be nice if it can bring code size down too... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
The problem with 'restrict' is that almost nobody uses it, and it does Also gcc traditionally didn't do a very good job using it (this might be better in the very latest versions). At least some of the 3.x often discarded this information. automatically. But it should work well as a way to get Fortran-like performance from HPC workloads written in C - which is where most of the people are who really want the alias analysis. It's more than just HPC -- a lot of code has critical loops. it seems like a nice opt-in thing that can be used where the aliases are verified and the code is particularly performance critical... Yes. I think we could use it in the kernel, although I'm not sure how many cases we would ever find where we really care. Very little I suspect. Also the optimizations that gcc does with this often increase the code size. While that can be a win, with people judging gcc's output apparently *ONLY* on the code size as seen in this thread[1] it would obviously not compete well. -Andi [1] although there are compilers around that generate smaller code than gcc at its best. -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
* Jamie Lokier ja...@shareable.org wrote: Ingo Molnar wrote: If it's used once in a single .c file it should be inlined even if it's large. As Linus has pointed out, because of GCC not sharing stack among different inlined functions, the above is surprisingly not true. Yes, but note that this has no relevance to the specific case of CONFIG_OPTIMIZE_INLINING: GCC can at most decide to inline _less_, not more. I.e. under CONFIG_OPTIMIZE_INLINING we can only end up having less stack sharing trouble. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Andi Kleen wrote: On Sun, Jan 11, 2009 at 04:21:03PM -0800, Linus Torvalds wrote: On Mon, 12 Jan 2009, Andi Kleen wrote: so at least least for this case it works. Your case also doesn't work for me. So it looks like gcc didn't like something you did in your test program. I very intentionally used _different_ types. If you use the same type, gcc will apparenrly happily say hey, I can combine two variables of the same type with different liveness into the same variable. Confirmed. But that's not the interesting case. Weird. I wonder where this strange restriction comes from. Something at the back of my mind said aliasing. $ gcc linus.c -O2 -S ; grep subl linus.s subl$1624, %esp $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s subl$824, %esp That's with 4.3.2. Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote: Something at the back of my mind said aliasing. $ gcc linus.c -O2 -S ; grep subl linus.s subl$1624, %esp $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s subl$824, %esp That's with 4.3.2. Interesting. Nonsensical, but interesting. What I find nonsensical is that -fno-strict-aliasing generates better code here. Normally one would expect the compiler seeing more aliases with that option and then be more conservative regarding any sharing. But it seems to be the other way round here. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Andi Kleen wrote: On Mon, Jan 12, 2009 at 11:02:17AM -0800, Linus Torvalds wrote: Something at the back of my mind said aliasing. $ gcc linus.c -O2 -S ; grep subl linus.s subl$1624, %esp $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s subl$824, %esp That's with 4.3.2. Interesting. Nonsensical, but interesting. What I find nonsensical is that -fno-strict-aliasing generates better code here. Normally one would expect the compiler seeing more aliases with that option and then be more conservative regarding any sharing. But it seems to be the other way round here. For this to be convolved with aliasing *AT ALL* indicates this is done incorrectly. This is about storage allocation, not aliases. Storage allocation only depends on lifetime. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Mon, 12 Jan 2009, Andi Kleen wrote: What I find nonsensical is that -fno-strict-aliasing generates better code here. Normally one would expect the compiler seeing more aliases with that option and then be more conservative regarding any sharing. But it seems to be the other way round here. No, that's not the surprising part. And in fact, now that you mention it, I can even tell you why gcc does what it does. But you'll need some background to it: Type-based aliasing is _stupid_. It's so incredibly stupid that it's not even funny. It's broken. And gcc took the broken notion, and made it more so by making it a by-the-letter-of-the-law thing that makes no sense. What happens (well, maybe it's fixed, but this was _literally_ what gcc used to do) is that the type-based aliasing overrode everything else, so if two accesses were to different types (and not in a union, and none of the types were char), then gcc knew that they clearly could not alias, and could thus wildly re-order accesses. That's INSANE. It's so incredibly insane that people who do that should just be put out of their misery before they can reproduce. But real gcc developers really thought that it makes sense, because the standard allows it, and it gives the compiler the maximal freedom - because it can now do things that are CLEARLY NONSENSICAL. And to compiler people, being able to do things that are clearly nonsensical seems to often be seen as a really good thing, because it means that they no longer have to worry about whether the end result works or not - they just got permission to do stupid things in the name of optimization. So gcc did. I know for a _fact_ that gcc would re-order write accesses that were clearly to (statically) the same address. Gcc would suddenly think that unsigned long a; a = 5; *(unsigned short *)a = 4; could be re-ordered to set it to 4 first (because clearly they don't alias - by reading the standard), and then because now the assignment of 'a=5' was later, the assignment of 4 could be elided entirely! And if somebody complains that the compiler is insane, the compiler people would say nyaah, nyaah, the standards people said we can do this, with absolutely no introspection to ask whether it made any SENSE. Anyway, once you start doing stupid things like that, and once you start thinking that the standard makes more sense than a human being using his brain for 5 seconds, suddenly you end up in a situation where you can move stores around wildly, and it's all 'correct'. Now, take my stupid example, and make fn1() do a.a = 1 and make fn2() do b.b = 2, and think about what a compiler that thinks it can re-order the two writes willy-nilly will do? Right. It will say ok, a.a and b.b can not alias EVEN IF THEY HAVE STATICALLY THE SAME ADDFRESS ON THE STACK, because they are in two different structres. So we can then re-order the accesses, and move the stores around. Guess what happens if you have that kind of insane mentality, and you then try to make sure that they really don't alias, so you allocate extra stack space. The fact is, Linux uses -fno-strict-aliasing for a damn good reason: because the gcc notion of strict aliasing is one huge stinking pile of sh*t. Linux doesn't use that flag because Linux is playing fast and loose, it uses that flag because _not_ using that flag is insane. Type-based aliasing is unacceptably stupid to begin with, and gcc took that stupidity to totally new heights by making it actually more important than even statically visible aliasing. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Mon, 12 Jan 2009, H. Peter Anvin wrote: This is about storage allocation, not aliases. Storage allocation only depends on lifetime. Well, the thing is, code motion does extend life-times, and if you think you can move stores across each other (even when you can see that they alias statically) due to type-based alias decisions, that does essentially end up making what _used_ to be disjoint lifetimes now be potentially overlapping. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: On Mon, 12 Jan 2009, Bernd Schmidt wrote: Something at the back of my mind said aliasing. $ gcc linus.c -O2 -S ; grep subl linus.s subl$1624, %esp $ gcc linus.c -O2 -S -fno-strict-aliasing; grep subl linus.s subl$824, %esp That's with 4.3.2. Interesting. Nonsensical, but interesting. Since they have no overlap in lifetime, confusing this with aliasing is really really broken (if the functions _hadn't_ been inlined, you'd have gotten the same address for the two variables anyway! So anybody who thinks that they need different addresses because they are different types is really really fundmantally confused!). I've never really looked at the stack slot sharing code. But I think it's not hard to see what's going on: no overlap in lifetime may be a temporary state. Let's say you have { variable_of_some_type a; writes to a; other stuff; reads from a; } { variable_of_some_other_type b; writes to b; other stuff; reads from b; } At the point where the compiler generates RTL, stack space has to be allocated for variables A and B. At this point the lifetimes are non-overlapping. However, if the compiler chooses to put them into the same stack location, the RTL-based alias analysis will happily conclude (based on the differing types) that the reads from A and the writes to B can't possibly conflict, and some passes may end up reordering them. End result: overlapping lifetimes and overlapping stack slots. Oops. Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Mon, 12 Jan 2009, Bernd Schmidt wrote: However, if the compiler chooses to put them into the same stack location, the RTL-based alias analysis will happily conclude (based on the differing types) that the reads from A and the writes to B can't possibly conflict, and some passes may end up reordering them. End result: overlapping lifetimes and overlapping stack slots. Oops. Yes, I came to the same conclusion. Of course, I knew a-priori that the real bug was using type-based alais analysis to make (statically visible) aliasing decisions, but I realize that there are people who never understood things like that. Sadly, some of them worked on gcc. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: This is about storage allocation, not aliases. Storage allocation only depends on lifetime. Well, the thing is, code motion does extend life-times, and if you think you can move stores across each other (even when you can see that they alias statically) due to type-based alias decisions, that does essentially end up making what _used_ to be disjoint lifetimes now be potentially overlapping. Sometimes code motion makes code faster and/or smaller but use more stack space. If you want to keep the stack use down, it blocks some other optimisations. Register allocation is similar: code motion optimisations may use more registers due to overlapping lifetimes, which causes more register spills and changes the code. The two interact; it's not trivial to optimise fully. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: But you'll need some background to it: You paint a somewhat one-sided picture bordering on FUD. Type-based aliasing is _stupid_. Type-based aliasing is simply an application of the language definition, and depending on the compiled application and/or target architecture, it can be essential for performance. It's _hard_ to tell whether two memory accesses can possibly conflict, and the ability to decide based on type makes a vast difference. This is not, as you suggest in another post, simply a mild inconvenience for the compiler that restricts scheduling a bit and forces the hardware sort it out at run-time. Too lazy to construct one myself, I googled for examples, and here's a trivial one that shows how it affects the ability of the compiler to eliminate memory references: typedef struct { short a, b, c; } Sample; void test(int* values, Sample *uniform, int count) { int i; for (i=0;icount;i++) { values[i] += uniform-b; } } Type-based aliasing is what allows you to eliminate a load from the loop. Most users probably expect this kind of optimization from their compiler, and it'll make a difference not just on Itanium. I'll grant you that if you're writing a kernel or maybe a malloc library, you have reason to be unhappy about it. But that's what compiler switches are for: -fno-strict-aliasing allows you to write code in a superset of C. So gcc did. I know for a _fact_ that gcc would re-order write accesses that were clearly to (statically) the same address. Gcc would suddenly think that unsigned long a; a = 5; *(unsigned short *)a = 4; could be re-ordered to set it to 4 first (because clearly they don't alias - by reading the standard), To be precise, what the standard says is that your example is not C, and therefore has no meaning. While this kind of thing does occur in the wild, it is infrequent, and the programs that used this kind of code have been fixed over the years. gcc even warns about code such as the above with -Wall, which makes this even more of a non-issue. linus2.c: In function 'foo': linus2.c:6: warning: dereferencing type-punned pointer will break strict-aliasing rules and then because now the assignment of 'a=5' was later, the assignment of 4 could be elided entirely! And if somebody complains that the compiler is insane, the compiler people would say nyaah, nyaah, the standards people said we can do this, with absolutely no introspection to ask whether it made any SENSE. The thing is, yours is a trivial example, but try to think further: in the general case the compiler can't tell whether two accesses can go to the same address at runtime. If it could, we wouldn't be having this discussion; I'm pretty sure this question reduces to the halting problem. That's why the compiler must have a set of conservative rules that allow it to decide that two accesses definitely _can't_ conflict. For all standards conforming programs, type based aliasing is such a rule. You could add code to weaken it by also checking against the address, but since that cannot be a reliable test that catches all problematic cases, what would be the point? So, in effect, if you're arguing that the compiler should detect the above case and override the type-based aliasing based on the known address, you're arguing that only subtle bugs in the application should be exposed, not the obvious ones. If you're arguing we should do away with type-based aliasing altogether, you're ignoring the fact that there are (a majority of) other users of gcc than the Linux kernel, they write standards-conforming C, and they tend to worry about performance of compiled code. The fact is, Linux uses -fno-strict-aliasing for a damn good reason: because the gcc notion of strict aliasing is one huge stinking pile of sh*t. Linux doesn't use that flag because Linux is playing fast and loose, it uses that flag because _not_ using that flag is insane. Not using this flag works for pretty much all user space applications these days. Type-based aliasing is unacceptably stupid to begin with, and gcc took that stupidity to totally new heights by making it actually more important than even statically visible aliasing. gcc makes use of statically visible aliasing if it can use it to prove that two accesses can't conflict even if they have the same type, but it's vastly less powerful than type based analysis. Since it's impossible in general to decide that two accesses must conflict, trying to avoid transformations based on such an attempt is completely senseless. Trying to do so would have no effect for conforming C programs, and avoid only a subset of the problematic cases for other programs, so it's a waste of time. So, to summarize: strict aliasing works for nearly every application these days, there's a compiler switch for the rest to turn it off, it can be a serious performance improvement, and the compiler
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Mon, 12 Jan 2009, Bernd Schmidt wrote: Too lazy to construct one myself, I googled for examples, and here's a trivial one that shows how it affects the ability of the compiler to eliminate memory references: Do you really think this is realistic or even relevant? The fact is (a) most people use similar types, so your example of short vs int is actually not very common. Type-based alias analysis is wonderful for finding specific examples of something you can optimize, but it's not actually all that wonderful in general. It _particularly_ isn't wonderful once you start looking at the downsides. When you're adding arrays of integers, you're usually adding integers. Not shorts. The shorts may be a great example of a special case, but it's a special case! (b) instructions with memory accesses aren't the problem - instructions that take cache misses are. Your example is an excellent example of that - eliding the simple load out of the loop makes just about absolutely _zero_ difference in any somewhat more realistic scenario, because that one isn't the one that is going to make any real difference anyway. The thing is, the way to optimize for modern CPU's isn't to worry over-much about instruction scheduling. Yes, it matters for the broken ones, but it matters in the embedded world where you still find in-order CPU's, and there the size of code etc matters even more. I'll grant you that if you're writing a kernel or maybe a malloc library, you have reason to be unhappy about it. But that's what compiler switches are for: -fno-strict-aliasing allows you to write code in a superset of C. Oh, I'd use that flag regardless yes. But what you didn't seem to react to was that gcc - for no valid reason what-so-ever - actually trusts (or at least trusted: I haven't looked at that code for years) provably true static alias information _less_ than the idiotic weaker type-based one. You make all this noise about how type-based alias analysis improves code, but then you can't seem to just look at the example I gave you. Type-based alias analysis didn't improve code. It just made things worse, for no actual gain. Moving those accesses to the stack around just causes worse behavior, and a bigger stack frame, which causes more cache misses. [ Again, I do admit that kernel code is different: we tend to have a cold stack, in ways that many other code sequences do not have. System code tends to get a lot more I$ and D$ misses. Deep call-chains _will_ take cache misses on the stack, simply because the user will do things between system calls or page faults that almost guarantees that things are not in L1, and often not in L2 either. Also, sadly, microbenchmarks often hide this, since they are often exactly the unrealistic kinds of back-to-back system calls that almost no real program ever has, since real programs actually _do_ something with the data. ] My point is, you're making all these arguments and avoiding looking at the downsides of what you are arguing for. So we use -Os - because it generally generates better (and simpler) code. We use -fno-strict-alias for the same reason. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sat, 2009-01-10 at 04:02 +0100, Andi Kleen wrote: Long term that problem will hopefully disappear, as gcc learns to do cross source file inlining (like a lot of other compilers already do) We've already been able to get GCC doing this for the kernel, in fact (the --combine -fwhole-program stuff I was working on a while back). It gives an interesting size reduction, especially in file systems and other places where we tend to have functions with a single call site... but in a different file. Linus argues that we don't want that kind of inlining because it harms debuggability, but that isn't _always_ true. Sometimes you weren't going to get a backtrace if something goes wrong _anyway_. And even if the size reduction doesn't necessarily give a corresponding performance improvement, we might not care. In the embedded world, size does matter too. And the numbers are such that you can happily keep debuginfo for the shipped kernel builds and postprocess any backtraces you get. Just as we can for distros. In general, I would much prefer being able to trust the compiler, rather than disabling its heuristics completely. We might not be able to trust it right now, but we should be working towards that state. Not just declaring that we know best, even though _sometimes_ we do. I think we should: - Unconditionally have 'inline' meaning 'always_inline'. If we say it, we should mean it. - Resist the temptation to use -fno-inline-functions. Allow GCC to inline other things if it wants to. - Reduce the number of unnecessary 'inline' markers, and have a policy that the use of 'inline' should be accompanied by either a GCC PR# or an explanation of why we couldn't reasonably have expected GCC to get this particular case right. - Have a similar policy of PR# or explanation for 'uninline' too. I don't think we should just give up on GCC ever getting it right. That way lies madness. As we've often found in the past. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sun, Jan 11, 2009 at 11:25:32AM -0800, Linus Torvalds wrote: On Sun, 11 Jan 2009, Andi Kleen wrote: The proposal was to use -fno-inline-functions-called-once (but the resulting numbers were not promising) Well, the _optimal_ situation would be to not need it, because gcc does a good job without it. That implies trying to find a better balance between worth it and causes problems. Rigth now, it does sound like gcc simply doesn't try to balance AT ALL, or balances only when we add some very version-specific random options (ie the stack-usage one). The gcc 4.3 inliner takes stack growth into account by default (without any special options). I experimented a bit with it when that was introduced and found the default thresholds are too large for the kernel and don't change the checkstack.pl picture much. I asked back then and was told --param large-stack-frame is expected to be a reasonable stable --param (as much as these can be) and I did a patch to lower it, but I couldn't get myself to actually submit it [if you really want it I can send it]. But of course that only helps for gcc 4.3+, older gccs would need a different workaround. On the other hand (my personal opinion, not shared by everyone) is that the ioctl switch stack issue is mostly only a problem with 4K stacks and in the rare cases when I still run 32bit kernels I never set that option because I consider it russian roulette (because there undoutedly dangerous dynamic stack growth cases that checkstack.pl doesn't flag) And even those options don't actually make much sense - yes, they balance things, but they don't do it in a sensible manner. For example: stack usage is undeniably a problem (we've hit it over and over again), but it's not about stack must not be larger than X bytes. If the call is done unconditionally, then inlining _one_ function will grow the static stack usage of the function we inline into, but it will _not_ grow the dynamic stack usage one whit - so deciding to not inline because of stack usage is pointless. Don't think the current inliner takes that into account from a quick look at the sources, although it probably could. Maybe Honza can comment. But even if it did it could only do that for a single file, but if the function is in a different file gcc doesn't have the information (unless you run with David's --combine hack). This means the kernel developers have to do it anyways. On the other hand I'm not sure it's that big a problem. Just someone should run make checkstack occasionally and add noinlines to large offenders. -Andi [keep quote for Honza's benefit] See? So stop inlining when you hit a stack limit IS THE WRONG THING TO DO TOO! Because it just means that the compiler continues to do bad inlining decisions until it hits some magical limit - but since the problem isn't the static stack size of any _single_ function, but the combined stack size of a dynamic chain of them, that's totally idiotic. You still grew the dynamic stack, and you have no way of knowing by how much - the limit on the static one simply has zero bearing what-so-ever on the dynamic one. So no, limit static stack usage is not a good option, because it stops inlining when it doesn't matter (single unconditional call), and doesn't stop inlining when it might (lots of sequential calls, in a deep chain). The other alternative is to let gcc do what it does, but (a) remove lots of unnecessary 'inline's. And we should likely do this regardless of any -fno-inline-functions-called-once issues. (b) add lots of 'noinline's to avoid all the cases where gcc screws up so badly that it's either a debugging disaster or an actual correctness issue. The problem with (b) is that it's a lot of hard thinking, and debugging disasters always happen in code that you didn't realize would be a problem (because if you had, it simply wouldn't be the debugging issue it is). Linus -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sun, 2009-01-11 at 21:14 +0100, Andi Kleen wrote: On the other hand (my personal opinion, not shared by everyone) is that the ioctl switch stack issue is mostly only a problem with 4K stacks and in the rare cases when I still run 32bit kernels I never set that option because I consider it russian roulette (because there undoutedly dangerous dynamic stack growth cases that checkstack.pl doesn't flag) Isn't the ioctl switch stack issue a separate GCC bug? It was/is assigning assigning separate space for local variables which are mutually exclusive. So instead of the stack footprint of the function with the switch() being equal to the largest individual stack size of all the subfunctions, it's equal to the _sum_ of the stack sizes of the subfunctions. Even though it'll never use them all at the same time. Without that bug, it would have been harmless to inline them all. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sun, 11 Jan 2009, Andi Kleen wrote: Was -- i think that got fixed in gcc. But again only in newer versions. I doubt it. People have said that about a million times, it has never gotten fixed, and I've never seen any actual proof. I think that what got fixed was that gcc now at least re-uses stack slots for temporary spills. But only for things that fit in registers - not if you actually had variables that are big enough to be of type MEM. And the latter is what tends to eat stack-space (ie structures etc on stack). But hey, maybe it really did get fixed. But the last big stack user wasn't that long ago, and I saw it and I have a pretty recent gcc (gcc-4.3.2 right now, it could obviously have been slightly older back a few months ago). Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: Actually, the real spin locks are now fair. We use ticket locks on x86. Well, at least we do unless you enable that broken paravirt support. I'm not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I don't much care. No, it will continue to use ticket locks, but there's the option to switch to byte locks or something else. Ticket locks are awesomely bad when the VCPU scheduler fights with the run-order required by the ticket order. J -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 11:54 -0800, Linus Torvalds wrote: I was pretty sure that adding the unlocked loop should provably not change the mutex lock semantics. Why? Because it's just basically equivalent to just doing the mutex_trylock() without really changing anything really fundamental in the mutex logic. And that argument is sadly totally bogus. It fails for the RT case, yes. It should still be true for regular tasks - if the owner tracking was accurate. The thing is, we used to have this guarantee that any contention would always go into the slowpath, and then in the slow-path we serialize using the spinlock. So I think the bug is still there, we just hid it better by breaking out of the loop with that if (need_resched()) always eventually triggering. And it would be ok if it really is guaranteed to _eventually_ trigger, and I guess with timeslices it eventually always will, but I suspect we could have some serious latency spikes. Yes, the owner getting preempted after acquiring the lock, but before setting the owner can give some nasties :-( I initially did that preempt_disable/enable around the fast path, but I agree that slowing down the fast path is unwelcome. Alternatively we could go back to block on !owner, with the added complexity of not breaking out of the spin on lock-owner != owner when !lock-owner, so that the premature owner clearing of the unlock fast path will not force a schedule right before we get a chance to acquire the lock. Let me do that.. The problem? Setting lock-count to 0. That will mean that the next mutex_unlock() will not necessarily enter the slowpath at all, and won't necessarily wake things up like it should. That's exactly what __mutex_fastpath_trylock() does (or can do, depending on the implementation), so if regular mutexes are correct in the face of a trylock stealing the lock in front of a woken up waiter, then we're still good. That said, I'm not seeing how mutexes aren't broken already. say A locks it: counter 1-0 then B contends: counter 0--1, added to wait list then C contentds: counter -1, added to wait list then A releases: counter -1-1, wake someone up, say B then D trylocks: counter 1-0 so B is back to wait list then D releases, 0-1, no wakeup Aaah, B going back to sleep sets it to -1 Therefore, I think we're good. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 2009-01-09 at 04:35 +0100, Andi Kleen wrote: On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote: Harvey Harrison wrote: We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. The question is if that would produce acceptable quality code. In theory it should, but I'm more than wondering if it really will. I actually often use noinline when developing code simply because it makes it easier to read oopses when gcc doesn't inline ever static (which it normally does if it only has a single caller). You know roughly where it crashed without having to decode the line number. I believe others do that too, I notice it's all over btrfs for example. For btrfs it was mostly about stack size at first. I'd use checkstack.pl and then run through the big funcs and figure out how they got so huge. It was almost always because gcc was inlining something it shouldn't, so I started using it on most funcs. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
* H. Peter Anvin h...@zytor.com wrote: Andi Kleen wrote: I'll try to annotate the inline asms (there's not _that_ many of them), and measure what the size impact is. You can just use the patch I submitted and that you rejected for most of them :) I just ran a sample build for x86-64 with gcc 4.3.0, these all allyesconfig builds (modulo the inlining option): : voreg 64 ; size o.*/vmlinux textdata bss dec hex filename 59421552 24912223 15560504 99894279 5f44407 o.noopt/vmlinux 57700527 24950719 15560504 98211750 5da97a6 o.opty/vmlinux 57590217 24940519 15560504 98091240 5d8c0e8 o.andi/vmlinux A 3% code size difference even on allyesconfig (1.8 MB!) is nothing to sneeze at. As shown by the delta from Andi's patch, these small assembly stubs really do need to be annotated, since gcc simply has no way to do anything sane with them -- it just doesn't know. I've done a finegrained size analysis today (see my other mail in this thread), and it turns out that on gcc 4.3.x the main (and pretty much only) inlining annotation that matters in arch/x86/include/asm/*.h is the onliner patch attached below, annotating constant_test_bit(). That change is included in Andi's patch too AFAICS - i.e. just that single hunk from Andi's patch would have given you 90% of the size win - an additional 0.17% size win to the 3.00% that CONFIG_OPTIMIZE_INLINING=y already brings. The second patch below had some (much smaller, 0.01% ) impact too. All the other annotations i did to hundreds of inlined asm()s had no measurable effect on GCC 4.3.2. (i.e. gcc appears to inline single-statement asms correctly) [ On older GCC it might matter more, but there we can/should turn off CONFIG_OPTIMIZE_INLINING. ] Personally, I'd like to see __asm_inline as opposed to __always_inline for these, though, as a documentation issue: __always_inline implies to me that this function needs to be inlined for correctness, and this could be highly relevant if someone, for example, recodes the routine in C or decides to bloat it out (e.g. paravirt_ops). Yeah. I've implemented __asm_inline today. It indeed documents the reason for the annotation in a cleaner way than slapping __always_inline around and diluting the quality of __always_inline annotations. It's not a perfect solution even then, because gcc may choose to not inline a higher level of inline functions for the same bogus reason. There isn't much we can do about that, though, unless gcc either integrates the assembler, or gives us some way of injecting the actual weight of the asm statement... Yeah. Ingo --- arch/x86/include/asm/bitops.h |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux/arch/x86/include/asm/bitops.h === --- linux.orig/arch/x86/include/asm/bitops.h +++ linux/arch/x86/include/asm/bitops.h @@ -300,7 +300,8 @@ static inline int test_and_change_bit(in return oldbit; } -static inline int constant_test_bit(int nr, const volatile unsigned long *addr) +static __asm_inline int +constant_test_bit(int nr, const volatile unsigned long *addr) { return ((1UL (nr % BITS_PER_LONG)) (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0; arch/x86/include/asm/bitops.h | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: linux/arch/x86/include/asm/bitops.h === --- linux.orig/arch/x86/include/asm/bitops.h +++ linux/arch/x86/include/asm/bitops.h @@ -53,7 +53,7 @@ * Note that @nr may be almost arbitrarily large; this function is not * restricted to acting on a single-word quantity. */ -static inline void set_bit(unsigned int nr, volatile unsigned long *addr) +static __asm_inline void set_bit(unsigned int nr, volatile unsigned long *addr) { if (IS_IMMEDIATE(nr)) { asm volatile(LOCK_PREFIX orb %1,%0 @@ -75,7 +75,7 @@ static inline void set_bit(unsigned int * If it's called on the same region of memory simultaneously, the effect * may be that only one operation succeeds. */ -static inline void __set_bit(int nr, volatile unsigned long *addr) +static __asm_inline void __set_bit(int nr, volatile unsigned long *addr) { asm volatile(bts %1,%0 : ADDR : Ir (nr) : memory); } @@ -90,7 +90,7 @@ static inline void __set_bit(int nr, vol * you should call smp_mb__before_clear_bit() and/or smp_mb__after_clear_bit() * in order to ensure changes are visible on other processors. */ -static inline void clear_bit(int nr, volatile unsigned long *addr) +static __asm_inline void clear_bit(int nr, volatile unsigned long *addr) { if (IS_IMMEDIATE(nr)) { asm volatile(LOCK_PREFIX andb %1,%0 @@ -117,7 +117,7 @@ static inline void clear_bit_unlock(unsi clear_bit(nr, addr); } -static inline void __clear_bit(int nr, volatile unsigned
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, Peter Zijlstra wrote: On Fri, 2009-01-09 at 11:47 +0100, Peter Zijlstra wrote: So I think the bug is still there, we just hid it better by breaking out of the loop with that if (need_resched()) always eventually triggering. And it would be ok if it really is guaranteed to _eventually_ trigger, and I guess with timeslices it eventually always will, but I suspect we could have some serious latency spikes. Yes, the owner getting preempted after acquiring the lock, but before setting the owner can give some nasties :-( I initially did that preempt_disable/enable around the fast path, but I agree that slowing down the fast path is unwelcome. Alternatively we could go back to block on !owner, with the added complexity of not breaking out of the spin on lock-owner != owner when !lock-owner, so that the premature owner clearing of the unlock fast path will not force a schedule right before we get a chance to acquire the lock. Let me do that.. Ok a few observations.. Adding that need_resched() in the outer loop utterly destroys the performance gain for PREEMPT=y. Voluntary preemption is mostly good, but somewhat unstable results. I was going to say a while ago... In PREEMPT=y the need_resched() is not needed at all. If you have preemption enabled, you will get preempted in that loop. No need for the need_resched() in the outer loop. Although I'm not sure how it would even hit the need_resched. If it was set, then it is most likely going to be cleared when coming back from being preempted. Adding that blocking on !owner utterly destroys everything. I was going to warn you about that ;-) Without the check for !owner, you are almost guaranteed to go to sleep every time. Here's why: You are spinning and thus have a hot cache on that CPU. The owner goes to unlock but will be in a cold cache. It sets lock-owner to NULL, but is still in cold cache so it is a bit slower. Once the spinner sees the NULL, it shoots out of the spin but sees the lock is still not available then goes to sleep. All before the owner could release it. This could probably happen at every contention. Thus, you lose the benefit of spinning. You probably make things worse because you add a spin before every sleep. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 2009-01-09 at 10:59 -0500, Steven Rostedt wrote: Adding that blocking on !owner utterly destroys everything. I was going to warn you about that ;-) Without the check for !owner, you are almost guaranteed to go to sleep every time. Here's why: You are spinning and thus have a hot cache on that CPU. The owner goes to unlock but will be in a cold cache. It sets lock-owner to NULL, but is still in cold cache so it is a bit slower. Once the spinner sees the NULL, it shoots out of the spin but sees the lock is still not available then goes to sleep. All before the owner could release it. This could probably happen at every contention. Thus, you lose the benefit of spinning. You probably make things worse because you add a spin before every sleep. Which is why I changed the inner loop to: l_owner = ACCESS_ONCE(lock-owner) if (l_owner l_owner != owner) break So that that would continue spinning. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, Ingo Molnar wrote: -static inline int constant_test_bit(int nr, const volatile unsigned long *addr) +static __asm_inline int +constant_test_bit(int nr, const volatile unsigned long *addr) { return ((1UL (nr % BITS_PER_LONG)) (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0; Thios makes absolutely no sense. It's called __always_inline, not __asm_inline. Why add a new nonsensical annotations like that? Also, the very fact that gcc gets that function wrong WHEN 'nr' IS CONSTANT (which is when it is called) just shows what kind of crap gcc is! Ingo, the fact is, I care about size, but I care about debuggability and sanity more. I don't care one _whit_ about 3% size differences, if they are insane and cause idiotic per-compiler differences. And you haven't done any interesting analysis per-file etc. It shoul be almost _trivial_ to do CONFIG_OPTIMIZE_INLINING on/off tests for the whole tree, and then comparing sizes of individual object files, and see if we find some obvious _bug_ where we just inline too much. In fact, we shouldn't even do that - we should try to find a mode where gcc simply refuses to inline at all, and compare that to one where it _only_ inlines the things we ask it to. Because that's the more relevant test. The problem with gcc inlining is actually two-fold: - gcc doesn't inline things we ask for Here the sub-problem is that we ask for this too much, but see above on how to figure -that- out! - gcc _does_ inline things that we haven't marked at all, causing too much stack-space to be used, and causing debugging problems. And here the problem is that gcc should damn well not do that, at least not as aggressively as it does! IT DOES NOT MATTER if something is called in just one place and inlining makes things smaller! If it's not a clear performance win (and it almost never is, unless the function is really small), the inlining of especially functions that aren't even hot in the cache is ONLY a negative thing. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: On Fri, 9 Jan 2009, Ingo Molnar wrote: -static inline int constant_test_bit(int nr, const volatile unsigned long *addr) +static __asm_inline int +constant_test_bit(int nr, const volatile unsigned long *addr) { return ((1UL (nr % BITS_PER_LONG)) (((unsigned long *)addr)[nr / BITS_PER_LONG])) != 0; Thios makes absolutely no sense. It's called __always_inline, not __asm_inline. Why add a new nonsensical annotations like that? __asm_inline was my suggestion, to distinguish inline this unconditionally because gcc screws up in the presence of asm() versus inline this unconditionally because the world ends if it isn't -- to tell the human reader, not gcc. I guess the above is a good indicator that the __asm_inline might have been a bad name. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, Steven Rostedt wrote: I was going to say a while ago... In PREEMPT=y the need_resched() is not needed at all. If you have preemption enabled, you will get preempted in that loop. No need for the need_resched() in the outer loop. Although I'm not sure how it would even hit the need_resched. If it was set, then it is most likely going to be cleared when coming back from being preempted. No, no, you miss the point entirely. It's not about correctness. Remember: the whole (and only) point of spinning is about performance. And the thing is, we should only spin if it makes sense. So the if (need_resched()) break; is not there because of any ok, I need to sleep now, it's there because of something TOTALLY DIFFERENT, namely ok, it makes no sense to spin now, since I should be sleeping. See? WE DO NOT WANT TO BE PREEMPTED in this region, because that totally destroys the whole point of the spinning. If we go through the scheduler, then we should go through the scheduler AND GO TO SLEEP, so that we don't go through the scheduler any more than absolutely necessary. So this code - by design - is always only going to get worse if you have involuntary preemption. The preemption is going to do _two_ bad things: - it's going to call the scheduler at the wrong point, meaning that we now scheduler _more_ (or at least not less) than if we didn't have that spin-loop in the first place. - .. and to make things worse, since it scheduled for us, it is going to clear that need_resched() flag, so we'll _stay_ in the bad spinning loop too long! So quite frankly, if you have CONFIG_PREEMPT, then the spinning really is the wrong thing to do, or the whole mutex slow-path thing should be done with preemption disabled so that we only schedule where we _should_ be scheduling. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, H. Peter Anvin wrote: __asm_inline was my suggestion, to distinguish inline this unconditionally because gcc screws up in the presence of asm() THERE IS NO ASM IN THERE! Guys, look at the code. No asm. The whole notion that gcc gets confused by inline asms IS BOGUS. It's simply not TRUE. Gcc gets confused because gcc is confused, and it has NOTHING to do with inline asms. So please don't confuse things further. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Peter Zijlstra wrote: Well, at least we do unless you enable that broken paravirt support. I'm not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I don't much care. Because the virtual cpu that has the ticket might not get scheduled for a while, even though another vcpu with a spinner is scheduled. The whole (para)virt is a nightmare in that respect. Hmm, are we in fact really using byte locks in CONFIG_PARAVIRT situation? Where are we actually setting pv_lock_ops.spin_lock pointer to point to __byte_spin_lock? Such initialization seems to happen only in paravirt_use_bytelocks() function, but my blind eyes prevent me from finding a callsite from which this function would eventually get called. -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 2009-01-09 at 11:44 -0500, Steven Rostedt wrote: When we get to the schedule() it then needs to be a: preempt_enable_no_resched(); schedule(); On that note: Index: linux-2.6/kernel/mutex.c === --- linux-2.6.orig/kernel/mutex.c +++ linux-2.6/kernel/mutex.c @@ -220,7 +220,9 @@ __mutex_lock_common(struct mutex *lock, __set_task_state(task, state); /* didnt get the lock, go to sleep: */ + preempt_disable(); spin_unlock_mutex(lock-wait_lock, flags); + preempt_enable_no_resched(); schedule(); spin_lock_mutex(lock-wait_lock, flags); } actually improves mutex performance on PREEMPT=y -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, Ingo Molnar wrote: So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one would be to mark it __always_inline [__asm_inline is senseless there], or the second patch below that changes the bit parameter to unsigned int. Well, I certainly don't want to _remove_ the inline like your patch did. Other gcc versions will care. But I committed the pure change to unsigned part. But we should fix the cmpxchg (and perhaps plain xchg too), shouldn't we? That your gcc version gets it right doesn't change the fact that Chris' gcc version didn't, and out-of-lined it all. So we'll need some __always_inlines there too.. And no, I don't think it makes any sense to call them __asm_inline. Even when there are asms hidden in between the C statements, what's the difference between always and asm? None, really. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
I've done a finegrained size analysis today (see my other mail in this thread), and it turns out that on gcc 4.3.x the main (and pretty much only) inlining annotation that matters in arch/x86/include/asm/*.h is the onliner patch attached below, annotating constant_test_bit(). That's pretty cool. Should definitely file a gcc bug report for that though so that they can fix gcc. Did you already do that or should I? -Andi -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
* Linus Torvalds torva...@linux-foundation.org wrote: On Fri, 9 Jan 2009, Ingo Molnar wrote: So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one would be to mark it __always_inline [__asm_inline is senseless there], or the second patch below that changes the bit parameter to unsigned int. Well, I certainly don't want to _remove_ the inline like your patch did. Other gcc versions will care. But I committed the pure change to unsigned part. But we should fix the cmpxchg (and perhaps plain xchg too), shouldn't we? That your gcc version gets it right doesn't change the fact that Chris' gcc version didn't, and out-of-lined it all. So we'll need some __always_inlines there too.. Yeah. I'll dig out an older version of gcc (latest distros are all 4.3.x based) and run the checks to see which inlines make a difference. And no, I don't think it makes any sense to call them __asm_inline. Even when there are asms hidden in between the C statements, what's the difference between always and asm? None, really. Well, the difference is small, nitpicky and insignificant: the thing is there are two logically separate categories of __always_inline: 1) the places where __always_inline means that in this universe no sane compiler ever can end up thinking to move that function out of line. 2) inlining for_correctness_ reasons: things like vreads or certain paravirt items. Stuff where the kernel actually crashes if we dont inline. Here if we do not inline we've got a materially crashy kernel. The original intention of __always_inline was to only cover the second category above - and thus self-document all the 'correctness inlines'. This notion has become bitrotten somewhat: we do use __always_inline in a few other places like the ticket spinlock inlines for non-correctness reasons. That bitrot happened because we simply have no separate symbol for the first category. So hpa suggested __asm_inline (yesterday, well before all the analysis was conducted) under the assumption that there would be many such annotations needed and that they would be all about cases where GCC's inliner gets confused by inline assembly. This theory turned out to be a red herring today - asm()s do not seem to confuse latest GCC. (although they certain confuse earlier versions, so it's still a practical issue, so i agree that we do need to annotate a few more places.) In any case, the __asm_inline name - even if it made some marginal sense originally - is totally moot now, no argument about that. The naming problem remains though: - Perhaps we could introduce a name for the first category: __must_inline? __should_inline? Not because it wouldnt mean 'always', but because it is 'always inline' for another reason than the correctless __always_inline. - Another possible approach wuld be to rename the second category to __force_inline. That would signal it rather forcefully that the inlining there is an absolute correctness issue. - Or we could go with the status quo and just conflate those two categories (as it is happening currently) and document the correctness inlines via in-source comments? But these are really nuances that pale in comparison to the fundamental questions that were asked in this thread, about the pure existence of this feature. If the optimize-inlining feature looks worthwile and maintainable to remain upstream then i'd simply like to see the information of these two categories preserved in a structured way (in 5 years i'm not sure i'd remember all the paravirt inlining details), and i dont feel too strongly about the style how we preserve that information. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 2009-01-09 at 22:34 +0100, Ingo Molnar wrote: The naming problem remains though: - Perhaps we could introduce a name for the first category: __must_inline? __should_inline? Not because it wouldnt mean 'always', but because it is 'always inline' for another reason than the correctless __always_inline. - Another possible approach wuld be to rename the second category to __force_inline. That would signal it rather forcefully that the inlining there is an absolute correctness issue. __needs_inline? That would imply that it's for correctness reasons. Then __always_inline is left to mean that it doesn't _need_ to be inline but we _want_ it inline regardless of what gcc thinks? $0.02 Harvey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, Harvey Harrison wrote: __needs_inline? That would imply that it's for correctness reasons. .. but the point is, we have _thousands_ of inlines, and do you know which is which? We've historically forced them to be inlined, and every time somebody does that OPTIMIZE_INLINE=y, something simply _breaks_. So instead of just continually hitting our head against this wall because some people seem to be convinced that gcc can do a good job, just do it the other way around. Make the new one be inline_hint (no underscores needed, btw), and there is ansolutely ZERO confusion about what it means. At that point, everybody knows why it's there, and it's clearly not a correctness issue or anything else. Of course, at that point you might as well argue that the thing should not exist at all, and that such a flag should just be removed entirely. Which I certainly agree with - I think the only flag we need is inline, and I think it should mean what it damn well says. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
* Linus Torvalds torva...@linux-foundation.org wrote: On Fri, 9 Jan 2009, Ingo Molnar wrote: So, should we not remove CONFIG_OPTIMIZE_INLINING, then the correct one would be to mark it __always_inline [__asm_inline is senseless there], or the second patch below that changes the bit parameter to unsigned int. Well, I certainly don't want to _remove_ the inline like your patch did. hm, that was a bug that i noticed and fixed in the second, fuller version of the patch i sent - which converts all the 'int nr' instances in bitops.h to 'unsigned int nr'. This is the only instance where the integer type of 'nr' matters in practice though, due to the modulo arithmetics. But for cleanliness reasons we want to do the full patch, to have a standard type signature for these bitop methods. Other gcc versions will care. But I committed the pure change to unsigned part. thanks! I'll clean up the rest - the second patch will now conflict (trivially). I also wanted to check the whole file more fully, there might be other details. [ So many files, so few nights ;-) ] We also might need more __always_inline's here and in other places, to solve the nonsensical inlining problems that Chris's case showed for example, with earlier GCCs. Another option would be to not trust earlier GCCs at all with this feature - to only define inline to not-__always_inline only on latest 4.3.x GCC - the only one that seems to at least not mess up royally. Thus CONFIG_OPTIMIZE_INLINING=y would have no effect on older GCCs. That would quarantine the problem (and the impact) sufficiently i think. And if future GCCs start messing up in this area we could zap the whole feature in a heartbeat. ( Although now that we have this feature this gives an incentive to compiler folks to tune their inliner on to the Linux kernel - for a decade we never allowed them to do that. The kernel clearly has one of the trickiest (and ugliest) inlining smarts in headers - and we never really exposed compilers to those things, so i'm not surprised at all that they mess up in cases. Unfortunately the version lifecycle of most compiler projects is measured in years, not in months like that of the kernel. There's many reasons for that - and not all of those reasons are strictly their fault. ) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 2009-01-09 at 13:50 -0800, Linus Torvalds wrote: On Fri, 9 Jan 2009, Harvey Harrison wrote: __needs_inline? That would imply that it's for correctness reasons. .. but the point is, we have _thousands_ of inlines, and do you know which is which? We've historically forced them to be inlined, and every time somebody does that OPTIMIZE_INLINE=y, something simply _breaks_. My suggestion was just an alternative to __force_inline as a naming...I agree that inline should mean __always_inline.always. So instead of just continually hitting our head against this wall because some people seem to be convinced that gcc can do a good job, just do it the other way around. Make the new one be inline_hint (no underscores needed, btw), and there is ansolutely ZERO confusion about what it means. agreed. At that point, everybody knows why it's there, and it's clearly not a correctness issue or anything else. Of course, at that point you might as well argue that the thing should not exist at all, and that such a flag should just be removed entirely. Which I certainly agree with - I think the only flag we need is inline, and I think it should mean what it damn well says. Also agreed, but there needs to start being some education about _not_ using inline so much in the kernel. Harvey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, Harvey Harrison wrote: Of course, at that point you might as well argue that the thing should not exist at all, and that such a flag should just be removed entirely. Which I certainly agree with - I think the only flag we need is inline, and I think it should mean what it damn well says. Also agreed, but there needs to start being some education about _not_ using inline so much in the kernel. Actually, the nice part about inline_hint would be that then we could have some nice config option like #ifdef CONFIG_FULL_CALL_TRACE #define inline_hint noinline #elif defined(CONFIG_TRUST_COMPILER) #define inline_hint /* */ #else #define inline_hint __inline #endif and now the _only_ thing we need to do is to remove the #define __inline__force_inline thing, and just agree that __inline is the native compiler meaning. We have a few users of __inline, but not very many. We can leave them alone, or just convert them to __inline__ or inline. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 2009-01-09 at 14:09 -0800, Linus Torvalds wrote: We have a few users of __inline, but not very many. We can leave them alone, or just convert them to __inline__ or inline. Actually I sent out a series of patches which mostly went in 2.6.27-28 timeframe, that's why there's a lot fewer __inline/__inline__ Other than one more block in scsi which has been hanging out in -mm for awhile, eliminating them should be pretty easy now. Harvey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
* Linus Torvalds torva...@linux-foundation.org wrote: On Fri, 9 Jan 2009, Harvey Harrison wrote: __needs_inline? That would imply that it's for correctness reasons. .. but the point is, we have _thousands_ of inlines, and do you know which is which? We've historically forced them to be inlined, and every time somebody does that OPTIMIZE_INLINE=y, something simply _breaks_. Having watched all the inline and anti-inline activites and patches of the past decade (and having participated in many of them) my strong impression is that any non-automated way is a fundamentally inhuman Don Quijote fight. The inlining numbers me and others posted seem to support that impression. Today we have in excess of thirty thousand 'inline' keyword uses in the kernel, and in excess of one hundred thousand kernel functions. We had a decade of hundreds of inline-tuning patches that flipped inline attributes on and off, with the goal of doing that job better than the compiler. Still a sucky compiler who was never faced with this level of inlining complexity before (up to a few short months ago when we released the first kernel with non-CONFIG_BROKEN-marked CONFIG_OPTIMIZE_INLINING feature in it) manages to do a better job at judging inlining than a decade of human optimizations managed to do. (If you accept that 1% - 3% - 7.5% code size reduction in important areas of the kernel is an improvement.) That improvement is systematic: it happens regardless whether it's core kernel developers who wrote the code, with years of kernel experience - or driver developers who came from Windows and might be inexperienced about it all and might slap 'inline' on every second random function. And it's not like the compiler was not allowed to inline important functions before: all static functions in .c it can (and do) inline if it sees fit. Tens of thousands of them. If we change 'inline' back to mean 'must inline' again, we have not changed the human dynamics of inlines at all and are back on square one. 'should_inline' or 'may_inline' will be an opt-in hint that will be subject to the same kind of misjudgements that resulted in the inlining situation to begin with. In .c files it's already possible to do that: by not placing an 'inline' keyword at all, just leaving the function 'static'. may_inline/inline_hint is a longer, less known and uglier keyword. So all the cards are stacked up against this new 'may inline' mechanism, and by all likelyhood it will fizzle and never reach any sort of critical mass to truly matter. Nor should it - why should humans do this if a silly tool can achieve something rather acceptable? So such a change will in essence amount to the effective removal of CONFIG_OPTIMIZE_INLINING. If we want to do that then we should do that honestly - and remove it altogether and not pretend to care. Fedora has CONFIG_OPTIMIZE_INLINING=y enabled today - distros are always on the lookout for kernel image reductor features. As of today i'm not aware of a single Fedora bugzilla that was caused by that. The upstream kernel did have bugs due to it - we had the UML breakage for example, and an older 3.x gcc threw an internal error on one of the (stale) isdn telephony drivers. Was Chris's crash actually caused by gcc's inlining decisions? I dont think it was. Historically we had far more compiler problems with CONFIG_CC_OPTIMIZE_SIZE=y - optimizing for size is a subtly complex and non-trivial compiler pass. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sat, 10 Jan 2009, Ingo Molnar wrote: may_inline/inline_hint is a longer, less known and uglier keyword. Hey, your choice, should you decide to accept it, is to just get rid of them entirely. You claim that we're back to square one, but that's simply the way things are. Either inline means something, or it doesn't. You argue for it meaning nothing. I argue for it meaning something. If you want to argue for it meaning nothing, then REMOVE it, instead of breaking it. It really is that simple. Remove the inlines you think are wrong. Instead of trying to change the meaning of them. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Harvey Harrison wrote: Oh yeah, and figure out what actually breaks on alpha such that they added the following (arch/alpha/include/asm/compiler.h) #ifdef __KERNEL__ /* Some idiots over in linux/compiler.h thought inline should imply always_inline. This breaks stuff. We'll include this file whenever we run into such problems. */ Does always_inline complain if the function isn't inlinable, while inline allows it? That would explain the alpha comment. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sat, Jan 10, 2009 at 12:53:42AM +, Jamie Lokier wrote: Harvey Harrison wrote: Oh yeah, and figure out what actually breaks on alpha such that they added the following (arch/alpha/include/asm/compiler.h) #ifdef __KERNEL__ /* Some idiots over in linux/compiler.h thought inline should imply always_inline. This breaks stuff. We'll include this file whenever we run into such problems. */ Does always_inline complain if the function isn't inlinable, while Yes it does. inline allows it? (unless you set -Winline which the kernel doesn't) -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sat, 10 Jan 2009, Jamie Lokier wrote: Does always_inline complain if the function isn't inlinable, while inline allows it? That would explain the alpha comment. I suspect it dates back to gcc-3.1 days. It's from 2004. And the author of that comment is a part-time gcc hacker who was probably offended by the fact that we thought (correctly) that a lot of gcc inlining was totally broken. Since he was the main alpha maintainer, he got to do things his way there.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sat, 10 Jan 2009, Ingo Molnar wrote: Well, it's not totally meaningless. To begin with, defining 'inline' to mean 'always inline' is a Linux kernel definition. So we already changed the behavior - in the hope of getting it right most of the time and in the hope of thus improving the kernel. Umm. No we didn't. We've never changed it. It was always inline back in the old days, and then we had to keep it always inline, which is why we override the default gcc meaning with the preprocessor. Now, OPTIMIZE_INLINING _tries_ to change the semantics, and people are complaining.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
- Headers could probably go back to 'extern inline' again. At not small expense - we just finished moving to 'static inline'. We'd need to guarantee a library instantiation for every header include file - this is an additional mechanism with additional introduction complexities and an ongoing maintenance cost. I thought the static inline in headers should be more of a always inline. As Andrew Morton keeps yelling at me to use static inline instead of macros ;-) I do not see the point in the functions in the headers needing to have their inlines removed. - 'static inline' functions in .c files that are not used cause no build warnings - while if we change them to 'static', we get a 'defined but not used' warning. Hundreds of new warnings in the allyesconfig builds. Perhaps that's a good thing to see what functions are unused in the source. I know that because i just have removed all variants of 'inline' from all .c files of the kernel, it's a 3.5MB patch: 3263 files changed, 12409 insertions(+), 12409 deletions(-) x86 defconfig comparisons: textfilename 6875817vmlinux.always-inline ( 0.000% ) 6838290vmlinux.always-inline+remove-c-inlines ( -0.548% ) 6794474vmlinux.optimize-inlining ( -1.197% ) So the kernel's size improved by half a percent. Should i submit it? Are there cases that are must inline in that patch? Also, what is the difference if you do vmlinux.optimize-remove-c-inlines? Is there a difference there? -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, Harvey Harrison wrote: On Sat, 2009-01-10 at 02:01 +0100, Ingo Molnar wrote: - Headers could probably go back to 'extern inline' again. At not small expense - we just finished moving to 'static inline'. We'd need to guarantee a library instantiation for every header include file - this is an additional mechanism with additional introduction complexities and an ongoing maintenance cost. Puzzled? What benefit is there to going back to extern inline in headers? There's none. In fact, it's wrong, unless you _also_ have an extern definition (according to the new gcc rules as of back in the days). Of course, as long as inline really means _always_ inline, it won't matter. So in that sense Ingo is right - we _could_. Which has no bearing on whether we _should_, of course. In fact, the whole mess with extern inline is a perfect example of why a inlining hit should be called may_inline or inline_hint or something like that. Because then it actually makes sense to have extern may_inline with one definition, and another definition for the non-inline version. And it's very clear what the deal is about, and why we literally have two versions of the same function. But again, that's very much not a let's use 'extern' instead of 'static'. It's a totally different issue. Linus [ A third reason to use extern inline is actually a really evil one: we could do it for our unrelated issue with system call definitions on architectures that require the caller to sign-extend the arguments. Since we don't control the callers of system calls, we can't do that, and architectures like s390 actually have potential security holes due to callers that don't follow the rules. So there are different needs for trusted - in-kernel - system call users that we know do the sign extension correctly, and untrusted - user-mode callers that just call through the system call function table. What we _could_ do is for the wrappers to use extern inline int sys_open(const char *pathname, int flags, mode_t mode) { return SYS_open(pathname, mode); } which gives the C callers the right interface without any unnecessary wrapping, and then long WRAP_open(const char *pathname, long flags, long mode) { return SYS_open(pathname, flags, mode); } asm (\t.globl sys_alias\n\t.set WRAP_open); which is the one that gets linked from any asm code. So now asm code and C code gets two different functions, even though they use the same system call name - one with inline expansion, one with linker games. Whee. The games we can play (and the odd reasons we must play them). ] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Sat, 10 Jan 2009 02:01:25 +0100 Ingo Molnar mi...@elte.hu wrote: * Linus Torvalds torva...@linux-foundation.org wrote: On Sat, 10 Jan 2009, Ingo Molnar wrote: may_inline/inline_hint is a longer, less known and uglier keyword. Hey, your choice, should you decide to accept it, is to just get rid of them entirely. You claim that we're back to square one, but that's simply the way things are. Either inline means something, or it doesn't. You argue for it meaning nothing. I argue for it meaning something. If you want to argue for it meaning nothing, then REMOVE it, instead of breaking it. It really is that simple. Remove the inlines you think are wrong. Instead of trying to change the meaning of them. Well, it's not totally meaningless. To begin with, defining 'inline' to mean 'always inline' is a Linux kernel definition. So we already changed the behavior - in the hope of getting it right most of the time and in the hope of thus improving the kernel. And now it appears that in our quest of improving the kernel we can further tweak that (already non-standard) meaning to a weak inline if the compiler agrees too hint. That gives us an even more compact kernel. It also moves the meaning of 'inline' closer to what the typical programmer expects it to be - for better or worse. We could remove them completely, but there are a couple of practical problems with that: - In this cycle alone, in the past ~2 weeks we added another 1300 inlines to the kernel. Who reviewed all that? Do we really want periodic postings of: [PATCH 0/135] inline removal cleanups ... in the next 10 years? We have about 20% of all functions in the kernel marked with 'inline'. It is a _very_ strong habit. Is it worth fighting against it? A side-effect of the inline fetish is that a lot of it goes into header files, thus requiring that those header files #include lots of other headers, thus leading to, well, the current mess. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009, H. Peter Anvin wrote: I was thinking about experimenting with this, to see what level of upside it might add. Ingo showed me numbers which indicate that a fairly significant fraction of the cases where removing inline helps is in .h files, which would require code movement to fix. Hence to see if it can be automated. We _definitely_ have too many inline functions in headers. They usually start out small, and then they grow. And even after they've grown big, it's usually not at all clear exactly where else they should go, so even when you realize that that shouldn't be inlined, moving them and making them uninlined is not obvious. And quite often, some of them go away - or at least shrink a lot - when some config option or other isn't set. So sometimes it's an inline because a certain class of people really want it inlined, simply because for _them_ it makes sense, but when you enable debugging or something, it absolutely explodes. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: And quite often, some of them go away - or at least shrink a lot - when some config option or other isn't set. So sometimes it's an inline because a certain class of people really want it inlined, simply because for _them_ it makes sense, but when you enable debugging or something, it absolutely explodes. And this is really why getting static inline annotations right is really hard if not impossible in the general case (especially when considering the sheer number of architectures we compile on.) So making it possible for the compiler to do the right thing for at least this class of functions really does seem like a good idea. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 10:09 -0500, Steven Rostedt wrote: On Thu, 8 Jan 2009, Peter Zijlstra wrote: Index: linux-2.6/kernel/sched.c === --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -4672,6 +4672,72 @@ need_resched_nonpreemptible: } EXPORT_SYMBOL(schedule); +#ifdef CONFIG_SMP +/* + * Look out! owner is an entirely speculative pointer + * access and not reliable. + */ +int spin_on_owner(struct mutex *lock, struct thread_info *owner) +{ + unsigned int cpu; + struct rq *rq; + int ret = 1; + + if (unlikely(!sched_feat(OWNER_SPIN))) I would remove the unlikely, if someone turns OWNER_SPIN off, then you have the wrong decision being made. Choices by users should never be in a likely or unlikely annotation. It's discrimination ;-) in the unlikely case we schedule(), that seems expensive enough to want to make the spin case ever so slightly faster. + return 0; + + preempt_disable(); +#ifdef CONFIG_DEBUG_PAGEALLOC + /* +* Need to access the cpu field knowing that +* DEBUG_PAGEALLOC could have unmapped it if +* the mutex owner just released it and exited. +*/ + if (probe_kernel_address(owner-cpu, cpu)) + goto out; +#else + cpu = owner-cpu; +#endif + + /* +* Even if the access succeeded (likely case), +* the cpu field may no longer be valid. +*/ + if (cpu = nr_cpumask_bits) + goto out; + + /* +* We need to validate that we can do a +* get_cpu() and that we have the percpu area. +*/ + if (!cpu_online(cpu)) + goto out; Should we need to do a get_cpu or something? Couldn't the CPU disappear between these two calls. Or does it do a stop-machine and the preempt disable will protect us? Did you miss the preempt_disable() a bit up? + + rq = cpu_rq(cpu); + + for (;;) { + if (lock-owner != owner) + break; + + /* +* Is that owner really running on that cpu? +*/ + if (task_thread_info(rq-curr) != owner) + break; + + if (need_resched()) { + ret = 0; + break; + } + + cpu_relax(); + } +out: + preempt_enable_no_resched(); + return ret; +} +#endif -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 10:28 -0500, Steven Rostedt wrote: On Thu, 8 Jan 2009, Peter Zijlstra wrote: in the unlikely case we schedule(), that seems expensive enough to want to make the spin case ever so slightly faster. OK, that makes sense, but I would comment that. Otherwise, it just looks like another misuse of the unlikely annotation. OK, sensible enough. Should we need to do a get_cpu or something? Couldn't the CPU disappear between these two calls. Or does it do a stop-machine and the preempt disable will protect us? Did you miss the preempt_disable() a bit up? No, let me rephrase it better. Does the preempt_disable protect against another CPU from going off line? Does taking a CPU off line do a stop_machine? Yes and yes. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Steven Rostedt wrote: + /* +* We need to validate that we can do a +* get_cpu() and that we have the percpu area. +*/ + if (!cpu_online(cpu)) + goto out; Should we need to do a get_cpu or something? Couldn't the CPU disappear between these two calls. Or does it do a stop-machine and the preempt disable will protect us? Did you miss the preempt_disable() a bit up? No, let me rephrase it better. Does the preempt_disable protect against another CPU from going off line? Does taking a CPU off line do a stop_machine? I just looked at the cpu hotplug code, and it does call stop machine. All is in order ;-) -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Peter Zijlstra wrote: This was done because the interaction between trylock_slowpath and that -1000 state hurt my head. Yeah, it was a stupid hacky thing to avoid the list_empty(), but doing it explicitly is fine (we don't hold the lock, so the list isn't necessarily stable, but doing list_empty() is fine because we don't ever dereference the pointers, we just compare the pointers themselves). I shouldn't have done that hacky thing, it wasn't worth it. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 08:58 -0800, Linus Torvalds wrote: Ok, I've gone through -v7, and I'm sure you're all shocked to hear it, but I have no complaints. *cheer*, except I guess we need to figure out what goes bad for Chris. Except that you dropped all the good commit commentary you had earlier ;) Yeah, I've yet to add that back, will do. The patch looks pretty good (except for the big #if 0 block in mutex-debug.c that I hope gets fixed, but I can't even really claim that I can be bothered), the locking looks fine (ie no locking at all), and the numbers seem pretty convinving. Oh, and I think the open-coded atomic_cmpxchg(count, 1, 0) == 1 could possibly just be replaced with a simple __mutex_fastpath_trylock(). I dunno. __mutex_fastpath_trylock() isn't always that neat -- see include/asm-generic/mutex-xchg.h -- and its a NOP on DEBUG_MUTEXES. Note how I used old_val for the list_empty() thing as well, we could possibly drop that extra condition though. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Steven Rostedt wrote: We keep spinning if the owner changes. I think we want to - if you have multiple CPU's and a heavily contended lock that acts as a spinlock, we still _do_ want to keep spinning even if another CPU gets the lock. And I don't even believe that is the bug. I suspect the bug is simpler. I think the need_resched() needs to go in the outer loop, or at least happen in the !owner case. Because at least with preemption, what can happen otherwise is - process A gets the lock, but gets preempted before it sets lock-owner. End result: count = 0, owner = NULL. - processes B/C goes into the spin loop, filling up all CPU's (assuming dual-core here), and will now both loop forever if they hold the kernel lock (or have some other preemption disabling thing over their down()). And all the while, process A would _happily_ set -owner, and eventually release the mutex, but it never gets to run to do either of them so. In fact, you might not even need a process C: all you need is for B to be on the same runqueue as A, and having enough load on the other CPU's that A never gets migrated away. So C might be in user space. I dunno. There are probably variations on the above. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Unrelated: On Thu, 8 Jan 2009, Chris Mason wrote: RIP: 0010:[8024f4de] [8024f4de] __cmpxchg+0x36/0x3f Ouch. HOW THE HELL DID THAT NOT GET INLINED? cmpxchg() is a _single_ instruction if it's inlined, but it's a horrible mess of dynamic conditionals on the (constant - per call-site) size argument if it isn't. It looks like you probably enabled the let gcc mess up inlining config option. Ingo - I think we need to remove that crap again. Because gcc gets the inlining horribly horribly wrong. As usual. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Linus Torvalds wrote: And I don't even believe that is the bug. I suspect the bug is simpler. I think the need_resched() needs to go in the outer loop, or at least happen in the !owner case. Because at least with preemption, what can happen otherwise is - process A gets the lock, but gets preempted before it sets lock-owner. End result: count = 0, owner = NULL. - processes B/C goes into the spin loop, filling up all CPU's (assuming dual-core here), and will now both loop forever if they hold the kernel lock (or have some other preemption disabling thing over their down()). And all the while, process A would _happily_ set -owner, and eventually release the mutex, but it never gets to run to do either of them so. In fact, you might not even need a process C: all you need is for B to be on the same runqueue as A, and having enough load on the other CPU's that A never gets migrated away. So C might be in user space. I dunno. There are probably variations on the above. Ouch! I think you are on to something: for (;;) { struct thread_info *owner; old_val = atomic_cmpxchg(lock-count, 1, 0); if (old_val == 1) { lock_acquired(lock-dep_map, ip); mutex_set_owner(lock); return 0; } if (old_val 0 !list_empty(lock-wait_list)) break; /* See who owns it, and spin on him if anybody */ owner = ACCESS_ONCE(lock-owner); The owner was preempted before assigning lock-owner (as you stated). if (owner !spin_on_owner(lock, owner)) break; We just spin :-( I think adding the: + if (need_resched()) + break; would solve the problem. Thanks, -- Steve cpu_relax(); } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Steven Rostedt wrote: In fact, you might not even need a process C: all you need is for B to be on the same runqueue as A, and having enough load on the other CPU's that A never gets migrated away. So C might be in user space. You're right about not needing process C. I dunno. There are probably variations on the above. Ouch! I think you are on to something: for (;;) { struct thread_info *owner; old_val = atomic_cmpxchg(lock-count, 1, 0); if (old_val == 1) { lock_acquired(lock-dep_map, ip); mutex_set_owner(lock); return 0; } if (old_val 0 !list_empty(lock-wait_list)) break; /* See who owns it, and spin on him if anybody */ owner = ACCESS_ONCE(lock-owner); The owner was preempted before assigning lock-owner (as you stated). If it was the current process that preempted the owner and these are RT tasks pinned to the same CPU and the owner is of lower priority than the spinner, we have a deadlock! Hmm, I do not think the need_sched here will even fix that :-/ -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Steven Rostedt wrote: Ouch! I think you are on to something: Yeah, there's somethign there, but looking at Chris' backtrace, there's nothing there to disable preemption. So if it was this simple case, it should still have preempted him to let the other process run and finish up. So I don't think Chris' softlockup is at least _exactly_ that case. There's something else going on too. That said, I do think it's a mistake for us to care about the value of spin_on_owner(). I suspect v8 should - always have if (need_resched()) break in the outer loop. - drop the return value from spin_on_owner(), and just break out if anything changes (including the need_resched() flag). - I'd also drop the old_value 0 test, and just test the list_empty() unconditionally. Aim for really simple. As to what to do about the !owner case - we do want to spin on it, but the interaction with preemption is kind of nasty. I'd hesitate to make the mutex_[un]lock() use preempt_disable() to avoid scheduling in between getting the lock and settign the owner, though - because that would slow down the normal fast-path case. Maybe we should just limit the spin on !owner to some maximal count. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 10:16 -0800, Linus Torvalds wrote: On Thu, 8 Jan 2009, Steven Rostedt wrote: Ouch! I think you are on to something: Yeah, there's somethign there, but looking at Chris' backtrace, there's nothing there to disable preemption. So if it was this simple case, it should still have preempted him to let the other process run and finish up. My .config has no lockdep or schedule debugging and voluntary preempt. I do have CONFIG_INLINE_OPTIMIZE on, its a good name for trusting gcc I guess. So I don't think Chris' softlockup is at least _exactly_ that case. There's something else going on too. That said, I do think it's a mistake for us to care about the value of spin_on_owner(). I suspect v8 should - always have if (need_resched()) break in the outer loop. - drop the return value from spin_on_owner(), and just break out if anything changes (including the need_resched() flag). - I'd also drop the old_value 0 test, and just test the list_empty() unconditionally. I'll give the above a shot, and we can address the preempt + !owner in another rev -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
* Linus Torvalds torva...@linux-foundation.org wrote: Unrelated: On Thu, 8 Jan 2009, Chris Mason wrote: RIP: 0010:[8024f4de] [8024f4de] __cmpxchg+0x36/0x3f Ouch. HOW THE HELL DID THAT NOT GET INLINED? cmpxchg() is a _single_ instruction if it's inlined, but it's a horrible mess of dynamic conditionals on the (constant - per call-site) size argument if it isn't. It looks like you probably enabled the let gcc mess up inlining config option. Ingo - I think we need to remove that crap again. Because gcc gets the inlining horribly horribly wrong. As usual. Apparently it messes up with asm()s: it doesnt know the contents of the asm() and hence it over-estimates the size [based on string heuristics] ... Which is bad - asm()s tend to be the most important entities to inline - all over our fastpaths . Despite that messup it's still a 1% net size win: textdata bss dec hex filename 7109652 1464684 802888 9377224 8f15c8 vmlinux.always-inline 7046115 1465324 802888 9314327 8e2017 vmlinux.optimized-inlining That win is mixed in slowpath and fastpath as well. I see three options: - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it's already default-off) - Change the asm() inline markers to something new like asm_inline, which defaults to __always_inline. - Just mark all asm() inline markers as __always_inline - realizing that these should never ever be out of line. We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. I'll try to annotate the inline asms (there's not _that_ many of them), and measure what the size impact is. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Ingo Molnar wrote: Apparently it messes up with asm()s: it doesnt know the contents of the asm() and hence it over-estimates the size [based on string heuristics] ... Right. gcc simply doesn't have any way to know how heavyweight an asm() statement is, and it WILL do the wrong thing in many cases -- especially the ones which involve an out-of-line recovery stub. This is due to a fundamental design decision in gcc to not integrate the compiler and assembler (which some compilers do.) Which is bad - asm()s tend to be the most important entities to inline - all over our fastpaths . Despite that messup it's still a 1% net size win: textdata bss dec hex filename 7109652 1464684 802888 9377224 8f15c8 vmlinux.always-inline 7046115 1465324 802888 9314327 8e2017 vmlinux.optimized-inlining That win is mixed in slowpath and fastpath as well. The good part here is that the assembly ones really don't have much subtlety -- a function call is at least five bytes, usually more once you count in the register spill penalties -- so __always_inline-ing them should still end up with numbers looking very much like the above. I see three options: - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it's already default-off) - Change the asm() inline markers to something new like asm_inline, which defaults to __always_inline. - Just mark all asm() inline markers as __always_inline - realizing that these should never ever be out of line. We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. I'll try to annotate the inline asms (there's not _that_ many of them), and measure what the size impact is. The main reason to do #2 over #3 would be for programmer documentation. There simply should be no reason to ever out-of-lining these. However, documenting the reason to the programmer is a valuable thing in itself. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 13:27 -0500, Chris Mason wrote: On Thu, 2009-01-08 at 10:16 -0800, Linus Torvalds wrote: On Thu, 8 Jan 2009, Steven Rostedt wrote: Ouch! I think you are on to something: Yeah, there's somethign there, but looking at Chris' backtrace, there's nothing there to disable preemption. So if it was this simple case, it should still have preempted him to let the other process run and finish up. My .config has no lockdep or schedule debugging and voluntary preempt. I do have CONFIG_INLINE_OPTIMIZE on, its a good name for trusting gcc I guess. The patch below isn't quite what Linus suggested, but it is working here at least. In every test I've tried so far, this is faster than the ugly btrfs spin. dbench v7.1:789mb/s dbench simple spin: 566MB/s 50 proc parallel creates v7.1:162 files/s avg sys: 1.6 50 proc parallel creates simple spin: 152 files/s avg sys: 2 50 proc parallel stat v7.1:2.3s total 50 proc parallel stat simple spin: 3.8s total It is less fair though, the 50 proc parallel creates had a much bigger span between the first and last proc's exit time. This isn't a huge shock, I think it shows the hot path is closer to a real spin lock. Here's the incremental I was using. It looks to me like most of the things that could change inside spin_on_owner mean we still want to spin. The only exception is the need_resched() flag. -chris diff --git a/kernel/mutex.c b/kernel/mutex.c index bd6342a..8936410 100644 --- a/kernel/mutex.c +++ b/kernel/mutex.c @@ -161,11 +161,13 @@ __mutex_lock_common(struct mutex *lock, long state, unsigned int subclass, return 0; } - if (old_val 0 !list_empty(lock-wait_list)) + if (!list_empty(lock-wait_list)) break; /* See who owns it, and spin on him if anybody */ owner = ACCESS_ONCE(lock-owner); + if (need_resched()) + break; if (owner !spin_on_owner(lock, owner)) break; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 13:14 -0500, Steven Rostedt wrote: On Thu, 8 Jan 2009, Steven Rostedt wrote: In fact, you might not even need a process C: all you need is for B to be on the same runqueue as A, and having enough load on the other CPU's that A never gets migrated away. So C might be in user space. You're right about not needing process C. I dunno. There are probably variations on the above. Ouch! I think you are on to something: for (;;) { struct thread_info *owner; old_val = atomic_cmpxchg(lock-count, 1, 0); if (old_val == 1) { lock_acquired(lock-dep_map, ip); mutex_set_owner(lock); return 0; } if (old_val 0 !list_empty(lock-wait_list)) break; /* See who owns it, and spin on him if anybody */ owner = ACCESS_ONCE(lock-owner); The owner was preempted before assigning lock-owner (as you stated). If it was the current process that preempted the owner and these are RT tasks pinned to the same CPU and the owner is of lower priority than the spinner, we have a deadlock! Hmm, I do not think the need_sched here will even fix that :-/ RT tasks could go directly to sleeping. The spinner would see them on the list and break out. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 11:13 -0800, Linus Torvalds wrote: Well, at least we do unless you enable that broken paravirt support. I'm not at all clear on why CONFIG_PARAVIRT wants to use inferior locks, but I don't much care. Because the virtual cpu that has the ticket might not get scheduled for a while, even though another vcpu with a spinner is scheduled. The whole (para)virt is a nightmare in that respect. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, Chris Mason wrote: On Thu, 2009-01-08 at 13:14 -0500, Steven Rostedt wrote: If it was the current process that preempted the owner and these are RT tasks pinned to the same CPU and the owner is of lower priority than the spinner, we have a deadlock! Hmm, I do not think the need_sched here will even fix that :-/ RT tasks could go directly to sleeping. The spinner would see them on the list and break out. True, we could do: if (owner) { if (!spin_on_owner(lock, owner)) break; } else if (rt_task(current)) break; That would at least solve the issue in the short term. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 19:33 +0100, Ingo Molnar wrote: * Linus Torvalds torva...@linux-foundation.org wrote: snip Ingo - I think we need to remove that crap again. Because gcc gets the inlining horribly horribly wrong. As usual. Apparently it messes up with asm()s: it doesnt know the contents of the asm() and hence it over-estimates the size [based on string heuristics] ... snip That win is mixed in slowpath and fastpath as well. I see three options: - Disable CONFIG_OPTIMIZE_INLINING=y altogether (it's already default-off) I'd like to see this, leave all the heuristics out of it, if I say inline, I don't mean _maybe_, I mean you'd better damn well inline it. On the other hand, alpha seems to be hand-disabling the inline means __always_inline in their arch headers, so if this option is kept, it should be enabled on alpha as that is the current state of play there and get rid of that arch-private bit. - Change the asm() inline markers to something new like asm_inline, which defaults to __always_inline. I'd suggest just making inline always mean __always_inline. And get to work removing inline from functions in C files. This is probably also the better choice to keep older gccs producing decent code. - Just mark all asm() inline markers as __always_inline - realizing that these should never ever be out of line. We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. Cheers, Harvey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
The problem? Setting lock-count to 0. That will mean that the next mutex_unlock() will not necessarily enter the slowpath at all, and won't necessarily wake things up like it should. Normally we set lock-count to 0 after getting the lock, and only _inside_ the spinlock, and then we check the waiters after that. The comment says it all: /* set it to 0 if there are no waiters left: */ if (likely(list_empty(lock-wait_list))) atomic_set(lock-count, 0); and the spinning case violates that rule. The difference is that here we have it set to -1 (in the non patched code), and we have to decide if we should change that to 0. To change from -1 to 0 needs the protection of the spin locks. In the loop, we only change from 1 to 0 which is the same as the fast path, and should not cause any problems. Now, the spinning case only sets it to 0 if we saw it set to 1, so I think the argument can go something like: Yep. - if it was 1, and we _have_ seen contention, then we know that at least _one_ person that set it to 1 must have gone through the unlock slowpath (ie it wasn't just a normal locked increment. Correct. - So even if _we_ (in the spinning part of stealing that lock) didn't wake the waiter up, the slowpath wakeup case (that did _not_ wake us up, since we were spinning and hadn't added ourselves to the wait list) must have done so. Agreed. So maybe it's all really really safe, and we're still guaranteed to have as many wakeups as we had go-to-sleeps. But I have to admit that my brain hurts a bit from worrying about this. I do not think that the issue with the previous bug that Chris showed had anything to do with the actual sleepers. The slow path never changes the lock to '1', so it should not affect the spinners. We can think of the spinners as not having true contention with the lock, and are just like a: while (cond) { if (mutex_trylock(lock)) goto got_the_lock; } Sleeping mutexes are not ever simple. Now you see why in -rt we did all this in the slow path ;-) -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Harvey Harrison wrote: We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. The question is if that would produce acceptable quality code. In theory it should, but I'm more than wondering if it really will. It would be ideal, of course, as it would mean less typing. I guess we could try it out by disabling any inline in the current code that isn't __always_inline... -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 2009-01-08 at 17:44 -0800, H. Peter Anvin wrote: Harvey Harrison wrote: We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. The question is if that would produce acceptable quality code. In theory it should, but I'm more than wondering if it really will. It would be ideal, of course, as it would mean less typing. I guess we could try it out by disabling any inline in the current code that isn't __always_inline... A lot of code was written assuming inline means __always_inline, I'd suggest keeping that assumption and working on removing inlines that aren't strictly necessary as there's no way to know what inlines meant 'try to inline' and what ones really should have been __always_inline. Not that I feel _that_ strongly about it. Cheers, Harvey -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote: Harvey Harrison wrote: We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. The question is if that would produce acceptable quality code. In theory it should, but I'm more than wondering if it really will. I actually often use noinline when developing code simply because it makes it easier to read oopses when gcc doesn't inline ever static (which it normally does if it only has a single caller). You know roughly where it crashed without having to decode the line number. I believe others do that too, I notice it's all over btrfs for example. -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, 8 Jan 2009, H. Peter Anvin wrote: Right. gcc simply doesn't have any way to know how heavyweight an asm() statement is I don't think that's relevant. First off, gcc _does_ have a perfectly fine notion of how heavy-weight an asm statement is: just count it as a single instruction (and count the argument setup cost that gcc _can_ estimate). That would be perfectly fine. If people use inline asms, they tend to use it for a reason. However, I doubt that it's the inline asm that was the biggest reason why gcc decided not to inline - it was probably the constant switch() statement. The inline function actually looks pretty large, if it wasn't for the fact that we have a constant argument, and that one makes the switch statement go away. I suspect gcc has some pre-inlining heuristics that don't take constant folding and simplifiation into account - if you look at just the raw tree of the function without taking the optimization into account, it will look big. Linus -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
From: Linus Torvalds torva...@linux-foundation.org Date: Thu, 8 Jan 2009 19:46:30 -0800 (PST) First off, gcc _does_ have a perfectly fine notion of how heavy-weight an asm statement is: just count it as a single instruction (and count the argument setup cost that gcc _can_ estimate). Actually, at least at one point, it counted the number of newline characters in the assembly string to estimate how many instructions are contained inside. It actually needs to know exaclty how many instructions are in there, to emit proper far branches and stuff like that, for some cpus. Since they never added an (optional) way to actually tell the compiler this critical piece of information, so I guess the newline hack is the best they could come up with. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: First off, gcc _does_ have a perfectly fine notion of how heavy-weight an asm statement is: just count it as a single instruction (and count the argument setup cost that gcc _can_ estimate). True. It's not what it's doing, though. It looks for '\n' and ';' characters, and counts the maximum instruction size for each possible instruction. The reason why is that gcc's size estimation is partially designed to select what kind of branches it needs to use on architectures which have more than one type of branches. As a result, it tends to drastically overestimate, on purpose. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Harvey Harrison wrote: A lot of code was written assuming inline means __always_inline, I'd suggest keeping that assumption and working on removing inlines that aren't strictly necessary as there's no way to know what inlines meant 'try to inline' and what ones really should have been __always_inline. Not that I feel _that_ strongly about it. Actually, we have that reasonably well down by now. There seems to be a couple of minor tweaking still necessary, but I think we're 90-95% there already. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Fri, 9 Jan 2009 04:35:31 +0100 Andi Kleen a...@firstfloor.org wrote: On Thu, Jan 08, 2009 at 05:44:25PM -0800, H. Peter Anvin wrote: Harvey Harrison wrote: We might still try the second or third options, as i think we shouldnt go back into the business of managing the inline attributes of ~100,000 kernel functions. Or just make it clear that inline shouldn't (unless for a very good reason) _ever_ be used in a .c file. The question is if that would produce acceptable quality code. In theory it should, but I'm more than wondering if it really will. I actually often use noinline when developing code simply because it makes it easier to read oopses when gcc doesn't inline ever static (which it normally does if it only has a single caller). You know roughly where it crashed without having to decode the line number. I believe others do that too, I notice it's all over btrfs for example. Plus there is the problem where foo() { char a[1000]; } bar() { char a[1000]; } zot() { foo(); bar(); } uses 2000 bytes of stack. Fortunately scripts/checkstack.pl can find these. If someone runs it. With the right kconfig settings. And with the right compiler version. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
On Thu, Jan 08, 2009 at 07:42:48PM -0800, Linus Torvalds wrote: I actually often use noinline when developing code simply because it makes it easier to read oopses when gcc doesn't inline ever static (which it normally does if it only has a single caller) Yes. Gcc inlining is a total piece of sh*t. The static inlining by default (unfortunately) saves a lot of text size. For testing I built an x86-64 allyesconfig kernel with -fno-inline-functions-called-once (which disables the default static inlining), and it increased text size by ~4.1% (over 2MB for a allyesconfig kernel). So I think we have to keep that, dropping it would cost too much :/ -Andi -- a...@linux.intel.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html