Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
* Mike Travis [EMAIL PROTECTED] wrote: could you please send whatever .c changes you have already, so that we can have a look at how the end result will look like? Doesnt have to build, i'm just curious about how it looks like in practice, semantically. I will, and the full allyesconfig does compile. And it's basically a benign change in that the functionality is still the same. I'm currently reordering it a bit to clean it up. btw., are the resulting instructions also expected to be the same? If yes then you might want to verify it all by making sure the md5's of the .o's do not change. (If that's not possible (gcc decides to compile it a bit differently) then no big deal, just wanted to mention the possibility.) Ingo -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Ingo Molnar wrote: * Mike Travis [EMAIL PROTECTED] wrote: could you please send whatever .c changes you have already, so that we can have a look at how the end result will look like? Doesnt have to build, i'm just curious about how it looks like in practice, semantically. I will, and the full allyesconfig does compile. And it's basically a benign change in that the functionality is still the same. I'm currently reordering it a bit to clean it up. btw., are the resulting instructions also expected to be the same? If yes then you might want to verify it all by making sure the md5's of the .o's do not change. (If that's not possible (gcc decides to compile it a bit differently) then no big deal, just wanted to mention the possibility.) Ingo Well, not exactly... ;-) It does institute the new API change that specifies only pointers to cpumask's can be passed to functions and returned from functions. I really wanted the default cpumask_t to be a constant so those instances where the passed in cpumask is used as a read/write temp variable would be caught. But it started getting messy. One pain is: typedef struct __cpumask_s *cpumask_t; const cpumask_t xxx; is not the same as: typedef const struct __cpumask_s *const_cpumask_t; const_cpumask_t xxx; and I'm not exactly sure why. It came up when I tried to declare functions that returned a constant cpumask_t pointer (node_to_cpumask, cpumask_of_cpu, etc.) The other major change I'm contemplating is to remove cpumask_t completely (maybe cpumask_ptr_t?). This would force every instance of cpumask_t to be examined. (I found quite a few I had missed in my original edits when I added the task struct temp cpumask's.) Oh yeah, one question ... is current always valid? Thanks, Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 30 Sep 2008, Mike Travis wrote: One pain is: typedef struct __cpumask_s *cpumask_t; const cpumask_t xxx; is not the same as: typedef const struct __cpumask_s *const_cpumask_t; const_cpumask_t xxx; and I'm not exactly sure why. Umm. The const has different One is typedef const struct __cpumask_s *const_cpumask_t; which becomes (const struct __cpumask_s) * while the other is const cpumask_t xxx which is const (struct __cpumask_s *) and if you look a bit more closely, you'll see that they are _obviously_ not the same thing at all. Quite frankly, I personally do hate typedefs that end up being pointers, and used as pointers, without showing that in the source code. When you do type_t a; fn(a); I expect the code to essentially do a pass-by-value. But when the type_t is a pointer, that doesn't really work. Your issue with 'const' is just another version of the same. You don't want the _pointer_ to be const, you want what it points _to_ to be const. But because you hid the pointerness inside the typedef, you simply cannot do that. The problem with cpumask's, of course, is that for the small mask case, we really don't want it to be a pointer. So now it's sometimes a pointer and sometimes not. The typedef hides that, and I understand why it's a good idea, but I'm surprised you didn't understand what the implications were for 'const', and I'm now a bit more leery about this whole thing just because the typedef ends up hiding so much - it doesn't just hide the basic type, it hides a very basic *code* issue. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Tue, 30 Sep 2008, Mike Travis wrote: One pain is: typedef struct __cpumask_s *cpumask_t; const cpumask_t xxx; is not the same as: typedef const struct __cpumask_s *const_cpumask_t; const_cpumask_t xxx; and I'm not exactly sure why. Umm. The const has different One is typedef const struct __cpumask_s *const_cpumask_t; which becomes (const struct __cpumask_s) * while the other is const cpumask_t xxx which is const (struct __cpumask_s *) and if you look a bit more closely, you'll see that they are _obviously_ not the same thing at all. Thanks, yes that explains what I should have figured out. (Nice way to explain it btw... ;-) Quite frankly, I personally do hate typedefs that end up being pointers, and used as pointers, without showing that in the source code. When you do type_t a; fn(a); I expect the code to essentially do a pass-by-value. But when the type_t is a pointer, that doesn't really work. I agree, and as I mentioned, Rusty was working towards an alternative method of declaring cpumask's which does not hide this. My goal was to create an (apparent) opaque handle for cpumask_t and modify the code so all changes to the contents of the cpumask are via functions. Your issue with 'const' is just another version of the same. You don't want the _pointer_ to be const, you want what it points _to_ to be const. But because you hid the pointerness inside the typedef, you simply cannot do that. The problem with cpumask's, of course, is that for the small mask case, we really don't want it to be a pointer. So now it's sometimes a pointer and sometimes not. The typedef hides that, and I understand why it's a good idea, but I'm surprised you didn't understand what the implications were for 'const', and I'm now a bit more leery about this whole thing just because the typedef ends up hiding so much - it doesn't just hide the basic type, it hides a very basic *code* issue. Perhaps then defining the cpumask as a list of unsigned longs (remove the outer struct) would play more naturally. Lists by definition are always referenced by pointers. I have another version of the patchset that shows this and I'll post just the cpumask.h and doc files. Linus Thanks! Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wednesday 01 October 2008 02:46:59 Linus Torvalds wrote: Quite frankly, I personally do hate typedefs that end up being pointers, and used as pointers, without showing that in the source code. ... I'm now a bit more leery about this whole thing just because the typedef ends up hiding so much - it doesn't just hide the basic type, it hides a very basic *code* issue. Yes, this is why my version of the rework moved away from typedefs, except for the special case of cpumask_var_t for stack vars where this trick is really desired. Everywhere else, the code becomes nice and clear: struct cpumask *. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Ingo Molnar wrote: * Mike Travis [EMAIL PROTECTED] wrote: Hi Rusty, I've gotten some good traction on the changes in the following patch. About 30% of the kernel is compiling right now and I'm picking up errors and warnings as I'm going along. I think it's doing most of what we need. Attempting to hide the cpumask struct definition caused all kinds of problems with the inline functions and statically declaring cpumask's. (The following patch is a combination of all the changes to cpumask.h with the header from the first patch. I'll send you a complete copy in separate email.) could you please send whatever .c changes you have already, so that we can have a look at how the end result will look like? Doesnt have to build, i'm just curious about how it looks like in practice, semantically. Ingo I will, and the full allyesconfig does compile. And it's basically a benign change in that the functionality is still the same. I'm currently reordering it a bit to clean it up. Thanks, Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
* Mike Travis [EMAIL PROTECTED] wrote: Hi Rusty, I've gotten some good traction on the changes in the following patch. About 30% of the kernel is compiling right now and I'm picking up errors and warnings as I'm going along. I think it's doing most of what we need. Attempting to hide the cpumask struct definition caused all kinds of problems with the inline functions and statically declaring cpumask's. (The following patch is a combination of all the changes to cpumask.h with the header from the first patch. I'll send you a complete copy in separate email.) could you please send whatever .c changes you have already, so that we can have a look at how the end result will look like? Doesnt have to build, i'm just curious about how it looks like in practice, semantically. Ingo -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Rusty Russell wrote: On Friday 26 September 2008 01:42:13 Linus Torvalds wrote: On Thu, 25 Sep 2008, Rusty Russell wrote: This turns out to be awful in practice, mainly due to const. Consider: #ifdef CONFIG_CPUMASK_OFFSTACK typedef unsigned long *cpumask_t; #else typedef unsigned long cpumask_t[1]; #endif cpumask_t returns_cpumask(void); No. That's already broken. You cannot return a cpumask_t, regardless of interface. We must not do it regardless of how we pass those things around, since it generates _yet_ another temporary on the stack for the return slot for any kind of structure. No, for large NR_CPUS, cpumask_t is a pointer as shown. And we have numerous basic functions which return a cpumask_t. Yes, this is part of the problem. What _is_ relevant is how we allocate them when we need temporary CPU masks. And _that_ is where my suggestion comes in. For small NR_CPUS, we really do want to allocate them on the stack, because calling kmalloc for a 4- or 8-byte allocation is just _stupid_. Right, but cpumask_t is used for far more than stack decls, thus the problems. I can make a separate cpumask_stack_t and use your method tho. I think that might even reduce churn and allow us to do this in parts. which has to be converted some way. And I think it needs to be converted in a way that does *not* force us to call kmalloc() for idiotically small values. Yeah, got that. But your suggestion to change cpumask_t turned out horribly ugly. Cheers, Rusty. Hi Rusty, I've gotten some good traction on the changes in the following patch. About 30% of the kernel is compiling right now and I'm picking up errors and warnings as I'm going along. I think it's doing most of what we need. Attempting to hide the cpumask struct definition caused all kinds of problems with the inline functions and statically declaring cpumask's. (The following patch is a combination of all the changes to cpumask.h with the header from the first patch. I'll send you a complete copy in separate email.) Thanks, Mike -- Subject: [RFC 1/1] cpumask: Provide new cpumask API Provide new cpumask interface API. The relevant change is basically cpumask_t becomes an opaque object. I believe this results in the minimum amount of editing while still allowing the inline cpumask functions, and the ability to declare static cpumask objects. /* raw declaration */ struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); }; /* cpumask_map_t used for declaring static cpumask maps */ typedef struct __cpumask_data_s cpumask_map_t[1]; /* cpumask_t used for function args and return pointers */ typedef struct __cpumask_data_s *cpumask_t; /* cpumask_var_t used for local variable, definition follows */ typedef struct __cpumask_data_s cpumask_var_t[1]; /* SMALL NR_CPUS */ typedef struct __cpumask_data_s *cpumask_var_t; /* LARGE NR_CPUS */ /* replaces cpumask_t dst = (cpumask_t)src */ void cpus_copy(cpumask_t dst, const cpumask_t src); Remove the '*' indirection in all references to cpumask_t objects. You can change the reference to the cpumask object but not the cpumask object itself without using the functions that operate on cpumask objects (f.e. the cpu_* operators). Functions can return a cpumask_t (which is a pointer to the cpumask object) and only be passed a cpumask_t. All uses of cpumask_t on the stack are changed to be cpumask_var_t except for pointers to static cpumask objects. Allocation of local (temp) cpumask objects will follow... All cpumask operators now operate using nr_cpu_ids instead of NR_CPUS. All variants of the cpumask operators which used nr_cpu_ids instead of NR_CPUS are deleted. All variants of functions which use the (old cpumask_t *) pointer are deleted (f.e. set_cpus_allowed_ptr()). Based on code from Rusty Russell [EMAIL PROTECTED] (THANKS!!) Signed-of-by: Mike Travis [EMAIL PROTECTED] --- struct-cpumasks.orig/include/linux/cpumask.h2008-09-25 20:40:59.303546951 -0700 +++ struct-cpumasks/include/linux/cpumask.h 2008-09-25 22:41:00.764472541 -0700 @@ -3,7 +3,8 @@ /* * Cpumasks provide a bitmap suitable for representing the - * set of CPU's in a system, one bit position per CPU number. + * set of CPU's in a system, one bit position per CPU number up to + * nr_cpu_ids (= NR_CPUS). * * See detailed comments in the file linux/bitmap.h describing the * data type on which these cpumasks are based. @@ -18,18 +19,6 @@ * For details of cpus_fold(), see bitmap_fold in lib/bitmap.c. * * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . - * Note: The alternate operations with the suffix _nr are used - * to limit the range of the loop to nr_cpu_ids instead of - * NR_CPUS when NR_CPUS 64 for performance reasons. - * If NR_CPUS is = 64 then most assembler bitmask - * operators execute faster with a constant range, so - *
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wednesday 27 August 2008 02:51:46 Linus Torvalds wrote: On Tue, 26 Aug 2008, Yinghai Lu wrote: wonder if could use unsigned long * directly. I would actually suggest something like this: - we continue to have a magic cpumask_t. - we do different cases for big and small NR_CPUS: #if NR_CPUS = BITS_PER_LONG /* * Make it an array - that way passing it as an argument will * always pass it as a pointer! */ typedef unsigned long cpumask_t[1]; static inline void create_cpumask(cpumask_t *p) { *p = 0; } static inline void free_cpumask(cpumask_t *p) { } #else typedef unsigned long *cpumask_t; static inline void create_cpumask(cpumask_t *p) { *p = kcalloc(..); } static inline void free_cpumask(cpumask_t *p) { kfree(*p); } #endif and now after you do this, you can just do something like cpumask_t mycpu; create_cpumask(mycpu); .. free_cpumask(mycpu); and in between, you can use 'cpumask' as a pointer, because even when it is an array directly allocated on the stack, the array can always degenerate into a pointer by C type rules! Hi Linus, This turns out to be awful in practice, mainly due to const. Consider: #ifdef CONFIG_CPUMASK_OFFSTACK typedef unsigned long *cpumask_t; #else typedef unsigned long cpumask_t[1]; #endif cpumask_t returns_cpumask(void); That's obviously illegal if cpumask_t is an array. So we need a typedef which says really always a pointer. typedef unsigned long *cpumask_return_t; cpumask_return_t returns_cpumask(void); But we usually want it to return a const ptr, and this doesn't work: const cpumask_return_t returns_cpumask(void); foo.c:12: warning: type qualifiers ignored on function return type So now we need: typedef const unsigned long *cpumask_const_return_t; cpumask_const_return_t returns_cpumask(void); OK, now consider a function which wants to take a const cpu bitmap: void cpus_copy(cpumask_t dst, const cpumask_t src); ... cpus_copy(cpus, returns_cpumask()); foo.c:34: warning: passing argument 2 of ‘cpus_copy’ discards qualifiers from pointer target type Oops, that didn't work with the pointer version. So we need another typedef: #ifdef CONFIG_CPUMASK_OFFSTACK typedef const unsigned long *cpumask_const_t; #else typedef const unsigned long cpumask_const_t[1]; #endif void cpus_copy(cpumask_t dst, cpumask_const_t src); We end up with this: #ifdef CONFIG_CPUMASK_OFFSTACK typedef unsigned long *cpumask_t; typedef const unsigned long *cpumask_const_t; #else typedef unsigned long cpumask_t[1]; typedef const unsigned long cpumask_const_t[1]; #endif typedef unsigned long *cpumask_return_t; typedef const unsigned long *cpumask_const_return_t; typedef unsigned long cpumask_data_t[1]; I can't see a neater way down this path, and I don't want to lose const. I can see three alternatives: 1) An ONSTACK_CPUMASK(name) macro which declares struct cpumask name[1] or struct cpumask *name. Same idea as yours, without the typedef. 2) Use a normal struct for cpumask, make everyone use pointers, but have an struct cpumask *alloc_stack_cpumask() which uses alloca() for small NR_CPUS. 3) Same, but just use kmalloc everywhere. Optimize important cases by hand. Anyone see a better way? Rusty. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Fri, 29 Aug 2008, Jes Sorensen wrote: I have only tested this on ia64, but it boots, so it's obviously perfecttm :-) Well, it probably boots because it doesn't really seem to _change_ much of anything. Hi Linus, I realize that, but as I have been doing this work on ia64, I didn't want to mess too much with the x86 code. The ia64 part of the patch does change things :-) If someone with more x86 knowledge would want to try and make that part of the patch better, and more importantly test it, I'd be quite keen on helping out. Cheers, Jes Things like this: -static inline void arch_send_call_function_ipi(cpumask_t mask) +static inline void arch_send_call_function_ipi(cpumask_t *mask) { - smp_ops.send_call_func_ipi(mask); + smp_ops.send_call_func_ipi(*mask); } will still do that stack allocation at the time of the call. You'd have to pass the thing all the way down as a pointer.. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
David Miller wrote: Otherwise you have to modify cpumask_t objects and thus pluck them onto the stack where they take up silly amounts of space. Yes, I had proposed either modifying, or supplementing a new smp_call function to pass the cpumask_t as a pointer (similar to set_cpus_allowed_ptr.) But an ABI change such as this was not well received at the time. What it seems to come down to is that any cpumask_t not inside of a dynamically allocated object should be marked const. And that is something we can enforce at compile time. Linus has just suggested dynamically allocating cpumask_t's for such cases but I don't see that as the fix either. Just mark them const and enforce that cpumask_t objects can only be modified when they appear in dynamically allocated objects. Dave and others, Sorry if I jump into the middle of the thread. Stopped subscribing to lkml for a while, so this is through the archives. Ran into some of these issues with KVM too, and noticed just how much we pass cpumask_t around in the smp_call functions :-( In fact, the only arch that did pretty well on this front was sparc64. I totally agree, that marking them const makes a ton of sense, but at the same time I suggest we convert smp_call_function_mask() to take a pointer to the cpumask_t. I whipped up the following patch, which cuts down the amont of memcpy calls emitted quite a fair bit. I have only tested this on ia64, but it boots, so it's obviously perfecttm :-) Comments, suggestions welcome. I have a followup patch that makes virt/kvm/kvm_main.c use the new interface. Cheers, Jes Change smp_call_function_mask() to take a pointer to the cpumask_t rather than passing it by value. This avoids recursive copies of the cpumask_t on the stack in the IPI call. For large NR_CPUS, this is particularly bad, and the cost of doing this for NR_CPUS bits_per_long is negligeble. Signed-off-by: Jes Sorensen [EMAIL PROTECTED] --- arch/alpha/include/asm/smp.h|2 +- arch/alpha/kernel/smp.c |4 ++-- arch/arm/include/asm/smp.h |2 +- arch/arm/kernel/smp.c |4 ++-- arch/ia64/include/asm/smp.h |2 +- arch/ia64/kernel/smp.c |6 +++--- arch/m32r/kernel/smp.c |4 ++-- arch/mips/kernel/smp.c |4 ++-- arch/parisc/kernel/smp.c|6 +++--- arch/powerpc/include/asm/smp.h |2 +- arch/powerpc/kernel/smp.c |4 ++-- arch/sh/include/asm/smp.h |2 +- arch/sh/kernel/smp.c|4 ++-- arch/sparc/include/asm/smp_64.h |2 +- arch/sparc64/kernel/smp.c |4 ++-- include/asm-m32r/smp.h |2 +- include/asm-mips/smp.h |2 +- include/asm-parisc/smp.h|2 +- include/asm-x86/smp.h |4 ++-- include/linux/smp.h |2 +- kernel/smp.c| 15 --- virt/kvm/kvm_main.c |4 ++-- 22 files changed, 42 insertions(+), 41 deletions(-) Index: linux-2.6.git/arch/alpha/include/asm/smp.h === --- linux-2.6.git.orig/arch/alpha/include/asm/smp.h +++ linux-2.6.git/arch/alpha/include/asm/smp.h @@ -48,7 +48,7 @@ #define cpu_possible_map cpu_present_map extern void arch_send_call_function_single_ipi(int cpu); -extern void arch_send_call_function_ipi(cpumask_t mask); +extern void arch_send_call_function_ipi(cpumask_t *mask); #else /* CONFIG_SMP */ Index: linux-2.6.git/arch/alpha/kernel/smp.c === --- linux-2.6.git.orig/arch/alpha/kernel/smp.c +++ linux-2.6.git/arch/alpha/kernel/smp.c @@ -637,9 +637,9 @@ send_ipi_message(to_whom, IPI_CPU_STOP); } -void arch_send_call_function_ipi(cpumask_t mask) +void arch_send_call_function_ipi(cpumask_t *mask) { - send_ipi_message(mask, IPI_CALL_FUNC); + send_ipi_message(*mask, IPI_CALL_FUNC); } void arch_send_call_function_single_ipi(int cpu) Index: linux-2.6.git/arch/arm/include/asm/smp.h === --- linux-2.6.git.orig/arch/arm/include/asm/smp.h +++ linux-2.6.git/arch/arm/include/asm/smp.h @@ -102,7 +102,7 @@ extern void platform_cpu_enable(unsigned int cpu); extern void arch_send_call_function_single_ipi(int cpu); -extern void arch_send_call_function_ipi(cpumask_t mask); +extern void arch_send_call_function_ipi(cpumask_t *mask); /* * Local timer interrupt handling function (can be IPI'ed). Index: linux-2.6.git/arch/arm/kernel/smp.c === --- linux-2.6.git.orig/arch/arm/kernel/smp.c +++ linux-2.6.git/arch/arm/kernel/smp.c @@ -356,9 +356,9 @@ local_irq_restore(flags); } -void arch_send_call_function_ipi(cpumask_t mask) +void arch_send_call_function_ipi(cpumask_t *mask) { - send_ipi_message(mask, IPI_CALL_FUNC); + send_ipi_message(*mask, IPI_CALL_FUNC); } void
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Fri, 29 Aug 2008, Jes Sorensen wrote: I have only tested this on ia64, but it boots, so it's obviously perfecttm :-) Well, it probably boots because it doesn't really seem to _change_ much of anything. Things like this: -static inline void arch_send_call_function_ipi(cpumask_t mask) +static inline void arch_send_call_function_ipi(cpumask_t *mask) { - smp_ops.send_call_func_ipi(mask); + smp_ops.send_call_func_ipi(*mask); } will still do that stack allocation at the time of the call. You'd have to pass the thing all the way down as a pointer.. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
From: Linus Torvalds [EMAIL PROTECTED] Date: Fri, 29 Aug 2008 09:14:44 -0700 (PDT) Well, it probably boots because it doesn't really seem to _change_ much of anything. Things like this: -static inline void arch_send_call_function_ipi(cpumask_t mask) +static inline void arch_send_call_function_ipi(cpumask_t *mask) { - smp_ops.send_call_func_ipi(mask); + smp_ops.send_call_func_ipi(*mask); } will still do that stack allocation at the time of the call. You'd have to pass the thing all the way down as a pointer.. True, but we have to get there one step at a time. BTW, sparc64 already wants a pointer here, so it's completely ready for this: void arch_send_call_function_ipi(cpumask_t mask) { xcall_deliver((u64) xcall_call_function, 0, 0, mask); } -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Thu, Aug 28, 2008 at 09:32:13AM +0900, Paul Mundt wrote: On Wed, Aug 27, 2008 at 08:35:44PM +0300, Adrian Bunk wrote: On Thu, Aug 28, 2008 at 01:00:52AM +0900, Paul Mundt wrote: On Wed, Aug 27, 2008 at 02:58:30PM +0300, Adrian Bunk wrote: In addition to that, debugging the runaway stack users on 4k tends to be easier anyways since you end up blowing the stack a lot sooner. On sh we've had pretty good luck with it, though most of our users are using fairly deterministic workloads and continually profiling the footprint. Anything that runs away or uses an insane amount of stack space needs to be fixed well before that anyways, so catching it sooner is always preferable. I imagine the same case is true for m68knommu (even sans IRQ stacks). CONFIG_DEBUG_STACKOVERFLOW should give you the same information, and if wanted with an arbitrary limit. In some cases, yes. In the CONFIG_DEBUG_STACKOVERFLOW case the check is only performed from do_IRQ(), which is sporadic at best, especially on tickless. While it catches some things, it's not a complete solution in and of iteslf. In addition to this, there are even fewer platforms that support it than there are platforms that do 4k stacks. At first glance, it looks like it's only m32r, powerpc, sh, x86, and xtensa. ... As far as I can see the only architectures that optionally offer 4kB stacks today are m68knommu, s390, sh and x86. Did I miss some architectures or is 5 4 ;) ? Others support the Kconfig option, but don't seem to realize that it's not an option that the kernel does anything with by itself, and so don't actually do anything (ie, FRV). Unless I miss anything these others include only FRV. IMHO there seems to currently be a mismatch between it's maintainance cost and the actual number of users. That's in my opinion the main problem with it, no matter in which direction it gets resolved. Perhaps that's true on x86, but in general I take issue with that. On sh we've had to do very little maintenance for it and most shipping products are using it today (at least on MMU-Linux, we don't bother with it on nommu). Most of the problems we ran in to with 4k stacks tended to be stuff that we wanted to fix for 8k anyways. I suspect that this case is true for the other embedded platforms also. ... Most stack issues are not platform or architecture specific. The maintainance effort therefore mostly depends on whether a non-zero number of architectures uses 4kB stacks. And if something is considered to be important for small embedded systems, but not supported on ARM, MIPS or PowerPC, then that's a bit strange. cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds writes: 4kB used to be the _only_ choice. And no, there weren't even irq stacks. So that 4kB was not just the whole kernel call-chain, it was also all the irq nesting above it. I think your memory is failing you. In 2.4 and earlier, the kernel stack was 8kB minus the size of the task_struct, which sat at the start of the 8kB. For instance, from include/asm-i386/processor.h for 2.4.29: #define THREAD_SIZE (2*PAGE_SIZE) #define alloc_task_struct() ((struct task_struct *) __get_free_pages(GFP_KERNEL,1)) #define free_task_struct(p) free_pages((unsigned long) (p), 1) Paul. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wednesday 27 August 2008 06:01, Mike Travis wrote: Dave Jones wrote: ... But yes, for this to be even remotely feasible, there has to be a negligable performance cost associated with it, which right now, we clearly don't have. Given that the number of people running 4096 CPU boxes even in a few years time will still be tiny, punishing the common case is obviously absurd. Dave I did do some fairly extensive benchmarking between configs of NR_CPUS = 128 and 4096 and most performance hits were in the neighborhood of 5% on systems with 8 cpus and 4GB of memory (our most common test system). 5% is a pretty nasty performance hit... what sort of benchmarks are we talking about here? I just made some pretty crazy changes to the VM to get only around 5 or so % performance improvement in some workloads. What places are making heavy use of cpumasks that causes such a slowdown? Hopefully callers can mostly be improved so they don't need to use cpumasks for common cases. Until then, it would be kind of sad for a distro to ship a generic x86 kernel and lose 5% performance because it is set to 4096 CPUs... But if I misunderstand and you're talking about specific microbenchmarks to find the worst case for huge cpumasks, then I take that back. [But changing cpumask_t's to be pointers instead of values will likely increase this.] I've tried to be very sensitive to this issue with all my previous changes, so convincing the distros to set NR_CPUS=4096 would be as painless for them as possible. ;-) Btw, huge count cpu systems I don't think are that far away. I believe the nextgen Larabbee chips will be geared towards HPC applications [instead of just GFX apps], and putting 4 of these chips on a motherboard would add up to 512 cpu threads (1024 if they support hyperthreading.) It would be quite interesting if they make them cache coherent / MP capable. Will they be? -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wednesday 27 August 2008 17:05, David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 27 Aug 2008 16:54:32 +1000 5% is a pretty nasty performance hit... what sort of benchmarks are we talking about here? I just made some pretty crazy changes to the VM to get only around 5 or so % performance improvement in some workloads. What places are making heavy use of cpumasks that causes such a slowdown? Hopefully callers can mostly be improved so they don't need to use cpumasks for common cases. It's almost certainly from the cross-call dispatch call chain. As just one example, just to do a TLB flush mm-cpu_vm_mask probably gets passed around as an aggregate two or three times on the way down to the APIC programming code on x86. That's two or three 512 byte copies on the stack :) Yeah, I see. That's stupid isn't it? (Well, I guess it was completely sane when cpumasks were word sized ;)) Hopefully that accounts for a significant chunk... -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 2008-08-26 at 18:54 -0400, Parag Warudkar wrote: On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds [EMAIL PROTECTED] wrote: And embedded people (the ones that might care about 1% code size) are the ones that would also want smaller stacks even more! This is something I never understood - embedded devices are not going to run more than a few processes and 4K*(Few Processes) IMHO is not worth a saving now a days even in embedded world given falling memory prices. Or do I misunderstand? Falling prices are no reason to increase the amount of available RAM (or other hardware). Especially if you (intend to) build 1E5 devices - where every Euro counts. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
You have a good point that aiming at 4kB makes 8kB a very safe choice. Not really no - we use separate IRQ stacks in 4K but not 8K mode on x86-32. That means you've actually got no more space if you are unlucky with the timing of events. The 8K mode is merely harder to debug. If 4K stacks really are not safe then x86-32 really really needs to switch to using IRQ stacks in 8K stack mode as well. Alan -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 2008-08-26 at 22:16 -0400, Parag Warudkar wrote: [...] Well, sure - but the industry as a whole seems to have gone the other The industry as a whole doesn't exist on that low level. You can't compare the laptop and/or desktop computer market (where one may buy today hardware that runs in 3 years with the next generation/release of the OS and applications) with the e.g. WLAN router market where - from the commercial point of view - every Euro counts (and where the requirements for the lifetime of the device are long frozen before the thing gets in a shop). way - do more with more at the similar or lower price points! By that definition of less is better we should try and make the kernel memory pageable (or has someone already done that?) - Windows does it, That doesn't help as in really small devices (like WLAN routers, cable modems, etc.) you run without any means of paging/swapping. And even binaries/read-only files are not necessarily executable in place (but must be loaded into RAM). So you can't flush these pages. And pageable kernel memory doesn't come for free - even if one only counts the increased code and it's complexity. by default ;) Which is more a sign that it is probably a very bad idea. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 27 Aug 2008 17:47:14 +1000 Yeah, I see. That's stupid isn't it? (Well, I guess it was completely sane when cpumasks were word sized ;)) Hopefully that accounts for a significant chunk... There is a lot of indirect costs that are hard to see as well. Two things a lot of these cross-call dispatch paths do is: 1) Clear self-cpu 2) AND with cpus_online #1 can normally be a simple bit clear, but some places can also implement this with something like cpus_andn(X, cpumask_of_cpu(cpu)) It's simply easier to move those two things down to the bottom of the APIC programming code, they just loop over the cpumask doing an expensive APIC I/O operation anyways, might as well overlap it with these skip self-cpu and skip not-online cpus checks. And oh yeah we get the stack wastage fixed too, isn't what what we were talking about? :-) -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
What about deep call chains? The problem with the uptake of 4K stacks seems to be that is not reliably provable that it will work under all circumstances. On x86-32 with 8K stacks your IRQ paths share them so that is even harder to prove (not that you can prove any of them) and the bugs are more obscure and random. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 05:28:37PM -0700, Linus Torvalds wrote: On Wed, 27 Aug 2008, Adrian Bunk wrote: When did we get callpaths like like nfs+xfs+md+scsi reliably working with 4kB stacks on x86-32? XFS may never have been usable, but the rest, sure. And you seem to be making this whole argument an excuse to SUCK, adn an excuse to let gcc crap even more on our stack space. Why? Why aren't you saying that we should be able to do better? Instead, you seem to asking us to do even worse than we do now? My main point is: - getting 4kB stacks working reliably is a hard task - having an eye on gcc increasing the stack usage, and fixing it if required, is relatively easy If we should be able to do better at getting (and keeping) 4kB stacks working, then coping with possible inlining problems caused by gcc should not be a big problem for us. Linus cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, Aug 27, 2008 at 4:25 AM, Alan Cox [EMAIL PROTECTED] wrote: You have a good point that aiming at 4kB makes 8kB a very safe choice. Not really no - we use separate IRQ stacks in 4K but not 8K mode on x86-32. That means you've actually got no more space if you are unlucky with the timing of events. The 8K mode is merely harder to debug. By your logic though, XFS on x86 should work fine with 4K stacks - many will attest that it does not and blows up due to stack issues. I have first hand experiences of things blowing up with deep call chains when using 4K stacks where 8K worked just fine on same workload. So there is definitely some other problem with 4K stacks. Thanks Parag -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, Aug 27, 2008 at 5:00 AM, Bernd Petrovitsch [EMAIL PROTECTED] wrote: They probably gave the idea pretty soon because you need to rework/improve large parts of the kernel + drivers (and that has two major problems - it consumes a lot of man power for no new features and everything must be completely tested again[0] and it adds new risks). And that is practically impossible if one sells stable driver APIs for 3rd party (commercial) drivers because these must be changed too. But not many embedded Linux arches support 4K stacks like Adrian pointed out earlier. So the same (lot of man power requirement) would apply to Linux. Sure it will be good - but how reasonable it is to attempt it and how reliably it will work under all conceived loads - those are the questions. Thanks Parag -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, 2008-08-27 at 08:56 -0400, Parag Warudkar wrote: On Wed, Aug 27, 2008 at 5:00 AM, Bernd Petrovitsch [EMAIL PROTECTED] wrote: They probably gave the idea pretty soon because you need to rework/improve large parts of the kernel + drivers (and that has two major problems - it consumes a lot of man power for no new features and everything must be completely tested again[0] and it adds new risks). And that is practically impossible if one sells stable driver APIs for 3rd party (commercial) drivers because these must be changed too. But not many embedded Linux arches support 4K stacks like Adrian What is an embedded Linux arch? Personally I encountered i386, ARM, MIPS and PPC in the embedded world. pointed out earlier. So the same (lot of man power requirement) would apply to Linux. Of course. Look at the amount of work done by lots of people in that area (including stack frame size reductions) and on-going discussions. Sure it will be good - but how reasonable it is to attempt it and how reliably it will work under all conceived loads - those are the questions. If you develop an embedded system (which is partly system integration of existing apps) to be installed in the field, you don't have that many conceivable work loads compared to a desktop/server system. And you have a fixed list of drivers and applications. A usual approach is to run stress tests on several (or all) subsystems/services/... in parallel and if the device survives it functioning correctly, it is at least good enough. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
By your logic though, XFS on x86 should work fine with 4K stacks - many will attest that it does not and blows up due to stack issues. I have first hand experiences of things blowing up with deep call chains when using 4K stacks where 8K worked just fine on same workload. So there is definitely some other problem with 4K stacks. Nothing of the sort. If it blows up with a 4K stack it will almost certainly blow up with an 8K stack *eventually* - when a heavy stack usage coincides with a heavy stack using IRQ handler. You won't catch it in simple testing, you won't catch it in trivial simulation and it'll be incredibly hard to reproduce. Not the kind of bug you want in a production system really. IRQ stacks make things much more predictable. Alan -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 27 Aug 2008 16:54:32 +1000 5% is a pretty nasty performance hit... what sort of benchmarks are we talking about here? I just made some pretty crazy changes to the VM to get only around 5 or so % performance improvement in some workloads. What places are making heavy use of cpumasks that causes such a slowdown? Hopefully callers can mostly be improved so they don't need to use cpumasks for common cases. It's almost certainly from the cross-call dispatch call chain. As just one example, just to do a TLB flush mm-cpu_vm_mask probably gets passed around as an aggregate two or three times on the way down to the APIC programming code on x86. That's two or three 512 byte copies on the stack :) Look at the sparc64 SMP code for how I solved the problem there. I will, thanks! Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
David Miller wrote: From: Nick Piggin [EMAIL PROTECTED] Date: Wed, 27 Aug 2008 17:47:14 +1000 Yeah, I see. That's stupid isn't it? (Well, I guess it was completely sane when cpumasks were word sized ;)) Hopefully that accounts for a significant chunk... There is a lot of indirect costs that are hard to see as well. Two things a lot of these cross-call dispatch paths do is: 1) Clear self-cpu 2) AND with cpus_online #1 can normally be a simple bit clear, but some places can also implement this with something like cpus_andn(X, cpumask_of_cpu(cpu)) It's simply easier to move those two things down to the bottom of the APIC programming code, they just loop over the cpumask doing an expensive APIC I/O operation anyways, might as well overlap it with these skip self-cpu and skip not-online cpus checks. And oh yeah we get the stack wastage fixed too, isn't what what we were talking about? :-) Yes, the most time consuming part was determining whether a kmalloc could safely be used in the context of the function, and what to do about the out-of-memory problem. Pushing that down to something like: for_each_cpu_thats_online(cpu, *maskptr) would remove the need for many of the temp masks. A simple if (cpu != me) would take care of excluding self. It might have better interaction with cpu hotplug as well, since the online map would be checked just before the call to that cpu is made. Thanks, Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, 27 Aug 2008, Paul Mackerras wrote: I think your memory is failing you. In 2.4 and earlier, the kernel stack was 8kB minus the size of the task_struct, which sat at the start of the 8kB. Yup, you're right. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Bernd Petrovitsch wrote: If you develop an embedded system (which is partly system integration of existing apps) to be installed in the field, you don't have that many conceivable work loads compared to a desktop/server system. And you have a fixed list of drivers and applications. Hah! Not in my line of embedded device. 32MB no-MMU ARM boards which people run new things and attach new devices to rather often - without making new hardware. Volume's too low per individual application to get new hardware designed and made. I'm seriously thinking of forwarding porting the 4 year old firmware from 2.4.26 to 2.6.current, just to get new drivers and capabilities. Backporting is tedious, so's feeling wretchedly far from the mainline world. A usual approach is to run stress tests on several (or all) subsystems/services/... in parallel and if the device survives it functioning correctly, it is at least good enough. Per application. Some little devices run hundreds of different applications and customers expect to customise, script themselves, and attach different devices (over USB). The next customer in the chain expects the bits you supplied to work in a variety of unexpected situations, even when you advise that it probably won't do that. Much like desktop/server Linux, but on a small device where silly little things like 'create a process' are a stress for the dear little thing. (My biggest lesson: insist on an MMU next time!) -- Jamie -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, Aug 27, 2008 at 02:58:30PM +0300, Adrian Bunk wrote: On Tue, Aug 26, 2008 at 05:28:37PM -0700, Linus Torvalds wrote: On Wed, 27 Aug 2008, Adrian Bunk wrote: When did we get callpaths like like nfs+xfs+md+scsi reliably working with 4kB stacks on x86-32? XFS may never have been usable, but the rest, sure. And you seem to be making this whole argument an excuse to SUCK, adn an excuse to let gcc crap even more on our stack space. Why? Why aren't you saying that we should be able to do better? Instead, you seem to asking us to do even worse than we do now? My main point is: - getting 4kB stacks working reliably is a hard task - having an eye on gcc increasing the stack usage, and fixing it if required, is relatively easy If we should be able to do better at getting (and keeping) 4kB stacks working, then coping with possible inlining problems caused by gcc should not be a big problem for us. Out of the architectures you've mentioned for 4k stacks, they also tend to do IRQ stacks, which is something you seem to have overlooked. In addition to that, debugging the runaway stack users on 4k tends to be easier anyways since you end up blowing the stack a lot sooner. On sh we've had pretty good luck with it, though most of our users are using fairly deterministic workloads and continually profiling the footprint. Anything that runs away or uses an insane amount of stack space needs to be fixed well before that anyways, so catching it sooner is always preferable. I imagine the same case is true for m68knommu (even sans IRQ stacks). Things might be more sensitive on x86, but it's certainly not something that's a huge problem for the various embedded platforms to wire up, whether they want to go the IRQ stack route or not. In any event, lack of support for something on embedded architectures in the kernel is more often due to apathy/utter indifference on the part of the architecture maintainer rather than being indicative of any intrinsic difficulty in supporting the thing in question. Most new features on the lesser maintained architectures tend to end up there either out of peer pressure or copying-and-pasting accidents rather than any sort of design. ;-) -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: Most LOCs of the kernel are not written by people like you or Al Viro or David Miller, and the average kernel developer is unlikely to do it as good as gcc. Sure. But we do have tools. We do have checkstack.pl, it's just that it hasn't been an issue in a long time, so I suspect many people didn't even _realize_ we have it, and I certainly can attest to the fact that even people who remember it - like me - don't actually tend to run it all that often. Sounds like what's really desired here isn't more worry and unpredictability, but for GCC+Binutils to gain the ability to calculate the stack depth over all callchains (doesn't have to be exact, just an upper bound; annotate recursions) in a way that's good enough to do on every compile, complain if a depth is exceeded statically (or it can't be proven), and to gain the architecture-independent option optimise to reduce stack usage. BTW: I just ran checkstack on a (roughly) allyesconfig kernel, and we have a new driver that allocates unsigned char recvbuf[1500]; on the stack... Yeah, it's _way_ too easy to do bad things. In my userspace code, I have macros tmp_alloc and tmp_free. They must be matched in the same function: unsigned char * recvbuf = tmp_alloc(1500); tmp_free(recvbuf); When stack is plentiful, it maps to alloca() which is roughly equivalent to using a stack variable. When stack is constrained (as it is on my little devices), that maps to xmalloc/free. The kernel equivalent would be kmalloc GFP_ATOMIC (perhaps). With different macros to mine, it may be possible to map small fixed-size requests exactly onto local variables, and large ones to kmalloc(). A stab at it (not tested): #define LOCAL_ALLOC_THRESHOLD 128 #define LOCAL_ALLOC(type, ptr)\ __typeof__(type) __attribute__((__unused__)) ptr##_local_struct; \ __typeof__(type) * ptr = \ ((__builtin_constant_p(sizeof(type))\ sizeof(type) = LOCAL_ALLOC_THRESHOLD) \ ? ptr##_local_struct : kmalloc(sizeof(type), GFP_ATOMIC)) #define LOCAL_FREE(ptr) \ ((__builtin_constant_p(sizeof (*(ptr))) \ sizeof(*(ptr)) = LOCAL_ALLOC_THRESHOLD) \ ? (void) 0 : kfree(ptr)) Would that be useful in the kernel? I'm thinking if it were a commonly used pattern for temporary buffers, unknown structures and arrays of macro-determined size, the new driver author would be less likely to accidentally drop a big object on the stack. Obviously it would be nicer for GCC to code such a thing automatically, but that really is wishful thinking. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, Aug 27, 2008 at 9:21 AM, Alan Cox [EMAIL PROTECTED] wrote: By your logic though, XFS on x86 should work fine with 4K stacks - many will attest that it does not and blows up due to stack issues. I have first hand experiences of things blowing up with deep call chains when using 4K stacks where 8K worked just fine on same workload. So there is definitely some other problem with 4K stacks. Nothing of the sort. If it blows up with a 4K stack it will almost certainly blow up with an 8K stack *eventually* - when a heavy stack usage coincides with a heavy stack using IRQ handler. You won't catch it in simple testing, you won't catch it in trivial simulation and it'll be incredibly hard to reproduce. Not the kind of bug you want in a production system really. IRQ stacks make things much more predictable. I see - so if I end up having a workload on 8k where heavy stack using IRQs and deep kernel call chains come at the same time - even 8K will blow up. So 4K will blow too except that it doesn't require IRQs also to use heavy stack, just XFS is good enough :) It then seems like the IRQs using lot of stack is not so much of a problem in the current kernel as much as deeper call chains and stack usage of normal non-irq path code is. So 8k makes it possible for the deeper call chains of non-irq path to survive since they get better part of the 8K to themselves and IRQs can do with less almost always. At least that's what I can derive from the fact that we do not have lots of reports of 8K stack blowing up. Thanks Parag -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, 2008-08-27 at 16:48 +0100, Jamie Lokier wrote: Bernd Petrovitsch wrote: If you develop an embedded system (which is partly system integration of existing apps) to be installed in the field, you don't have that many conceivable work loads compared to a desktop/server system. And you have a fixed list of drivers and applications. Hah! Not in my line of embedded device. 32MB no-MMU ARM boards which people run new things and attach new devices to rather often - without making new hardware. Volume's too low per individual application to get new hardware designed and made. Yes, you may have several products on the same hardware with somewhat differing requirements (or not). But that is much less than a general purpose system IMHO. I'm seriously thinking of forwarding porting the 4 year old firmware from 2.4.26 to 2.6.current, just to get new drivers and capabilities. That sounds reasonable (and I never meant maintaining the old system infinitely. Actually once the thing is shipped it usually enters deep maintenance mode and the next is more a fork from the old). Backporting is tedious, so's feeling wretchedly far from the mainline world. ACK. But that also depends on amount local changes (and sorry, but not all locally necessary patches would be accepted in mainline in any way). A usual approach is to run stress tests on several (or all) subsystems/services/... in parallel and if the device survives it functioning correctly, it is at least good enough. Per application. Some little devices run hundreds of different applications and customers expect to customise, script themselves, and attach different devices (over USB). The next customer in the chain expects the bits you supplied to work in a variety of unexpected situations, even when you advise that it probably won't do that. Basically their problem. Yes, they actually think they get a Linux system where they can do everything and it simply works. Oh, that's obviously not a usual WLAN-router style of product (where you are not expected to actually login on a console or per ssh). Much like desktop/server Linux, but on a small device where silly little things like 'create a process' are a stress for the dear little thing. (My biggest lesson: insist on an MMU next time!) ACK. We avoid MMU-less hardware too - especially since there is enough hardware with a MMU around. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Bernd Petrovitsch wrote: 32MB no-MMU ARM boards which people run new things and attach new devices to rather often - without making new hardware. Volume's too low per individual application to get new hardware designed and made. Yes, you may have several products on the same hardware with somewhat differing requirements (or not). But that is much less than a general purpose system IMHO. It is, but the idea that small embedded systems go through a 'all components are known, drivers are known, test and if it passes it's shippable' does not always apply. I'm seriously thinking of forwarding porting the 4 year old firmware from 2.4.26 to 2.6.current, just to get new drivers and capabilities. That sounds reasonable (and I never meant maintaining the old system infinitely. Sounds reasonable, but it's vetoed for anticipated time and cost, compared with backporting on demand. Fair enough, since 2.6.current doesn't support ARM no-MMU last I heard ('soon'?). On the other hand, the 2.6 anti-fragmentation patches, including latest SLUB stuff, ironically meant to help big machines, sound really appealing for my current problem and totally unrealistic to backport... ACK. We avoid MMU-less hardware too - especially since there is enough hardware with a MMU around. I can't emphasise enough how much difference MMU makes to Linux userspace. It's practically: MMU = standard Linux (with less RAM), have everything. No-MMU = lots of familiar 'Linux' things not available or break. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Mit, 2008-08-27 at 18:51 +0100, Jamie Lokier wrote: Bernd Petrovitsch wrote: [...] It is, but the idea that small embedded systems go through a 'all components are known, drivers are known, test and if it passes it's shippable' does not always apply. Not always but often enough. And yes, there is ARM-based embedded hardware with 1GB Flash-RAM and 128MB RAM. I'm seriously thinking of forwarding porting the 4 year old firmware from 2.4.26 to 2.6.current, just to get new drivers and capabilities. That sounds reasonable (and I never meant maintaining the old system infinitely. Sounds reasonable, but it's vetoed for anticipated time and cost, That is to be expected;-) [] ACK. We avoid MMU-less hardware too - especially since there is enough hardware with a MMU around. I can't emphasise enough how much difference MMU makes to Linux userspace. It's practically: MMU = standard Linux (with less RAM), have everything. No-MMU = lots of familiar 'Linux' things not available or break. ACK. And tell that a customer that everything is more effort and more risk and not just simply cross-compile it as it runs on my desktop too. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, Aug 27, 2008 at 05:46:05PM -0700, David Miller wrote: From: Paul Mundt [EMAIL PROTECTED] Date: Thu, 28 Aug 2008 09:32:13 +0900 On Wed, Aug 27, 2008 at 08:35:44PM +0300, Adrian Bunk wrote: CONFIG_DEBUG_STACKOVERFLOW should give you the same information, and if wanted with an arbitrary limit. In some cases, yes. In the CONFIG_DEBUG_STACKOVERFLOW case the check is only performed from do_IRQ(), which is sporadic at best, especially on tickless. While it catches some things, it's not a complete solution in and of iteslf. BTW, on sparc64 we have a stack overflow checker that runs via the profiling _mcount hook. So every function call we check if the stack is getting overused. If so, we jump onto a special static debugging stack and print the stack overflow message. And yes it works with IRQ stacks which is all that sparc64 uses nowadays. Perhaps this is useful enough to make generic. Thanks for the pointer, I'll take a look at it! -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Paul Mundt wrote: On Wed, Aug 27, 2008 at 02:58:30PM +0300, Adrian Bunk wrote: On Tue, Aug 26, 2008 at 05:28:37PM -0700, Linus Torvalds wrote: On Wed, 27 Aug 2008, Adrian Bunk wrote: When did we get callpaths like like nfs+xfs+md+scsi reliably working with 4kB stacks on x86-32? XFS may never have been usable, but the rest, sure. And you seem to be making this whole argument an excuse to SUCK, adn an excuse to let gcc crap even more on our stack space. Why? Why aren't you saying that we should be able to do better? Instead, you seem to asking us to do even worse than we do now? My main point is: - getting 4kB stacks working reliably is a hard task - having an eye on gcc increasing the stack usage, and fixing it if required, is relatively easy If we should be able to do better at getting (and keeping) 4kB stacks working, then coping with possible inlining problems caused by gcc should not be a big problem for us. Out of the architectures you've mentioned for 4k stacks, they also tend to do IRQ stacks, which is something you seem to have overlooked. In addition to that, debugging the runaway stack users on 4k tends to be easier anyways since you end up blowing the stack a lot sooner. On sh we've had pretty good luck with it, though most of our users are using fairly deterministic workloads and continually profiling the footprint. Anything that runs away or uses an insane amount of stack space needs to be fixed well before that anyways, so catching it sooner is always preferable. I imagine the same case is true for m68knommu (even sans IRQ stacks). Yep, definitely true for m68knommu in my experience. I haven't had any problems with 4k stacks recently. But yes the workloads do tend to be constrained - and almost never use any of the more exotic filesystems or drivers. Things might be more sensitive on x86, but it's certainly not something that's a huge problem for the various embedded platforms to wire up, whether they want to go the IRQ stack route or not. In any event, lack of support for something on embedded architectures in the kernel is more often due to apathy/utter indifference on the part of the architecture maintainer rather than being indicative of any intrinsic difficulty in supporting the thing in question. Most new features on the lesser maintained architectures tend to end up there either out of peer pressure or copying-and-pasting accidents rather than any sort of design. ;-) Indeed :-) Regards Greg Greg Ungerer -- Chief Software Dude EMAIL: [EMAIL PROTECTED] Secure Computing CorporationPHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
* Linus Torvalds [EMAIL PROTECTED] wrote: On Mon, 25 Aug 2008, Linus Torvalds wrote: checkstack.pl shows these things as the top problems: 0x80266234 smp_call_function_mask [vmlinux]:2736 0x80234747 __build_sched_domains [vmlinux]: 2232 0x8023523f __build_sched_domains [vmlinux]: 2232 Anyway, the reason smp_call_function_mask and friends have such _huge_ stack usages for you is that they contain a 'cpumask_t' on the stack. In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in size. And they tend to call each other. Quite frankly, I don't think we were really ready for 4k CPU's. I'm going to commit this patch to make sure others don't do that many CPU's by mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and also limits the number of CPU's that you _can_ select to just 512. yeah, that's OK i guess - distros can still enable 4K support if they wish to. Someone interested in improving the stack footprint situation should dust off the max-stack-footprint tracer so that we can catch these things in a more structured way. And i guess the next generation of 4K CPUs support should just get away from cpumask_t-on-kernel-stack model altogether, as the current model is not maintainable. We tried the on-kernel-stack variant, and it really does not work reliably. We can fix this in v2.6.28. Ingo -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
From: Ingo Molnar [EMAIL PROTECTED] Date: Tue, 26 Aug 2008 09:22:20 +0200 And i guess the next generation of 4K CPUs support should just get away from cpumask_t-on-kernel-stack model altogether, as the current model is not maintainable. We tried the on-kernel-stack variant, and it really does not work reliably. We can fix this in v2.6.28. I recenetly did some work on sparc64 to use cpumask pointers as much as possible. The only case that didn't work was due to a limitation in arch interfaces for the new generic smp_call_function() code. It passes a cpumask_t instead of a pointer to one via arch_send_call_function_ipi(). But other than that, the whole sparc64 SMP stuff uses cpumask_t pointers only. What it comes down to is that you have to do the self cpu and other tests in the cross-call dispatch routines themselves, instead of at the top-level working on cpumask_t objects. Otherwise you have to modify cpumask_t objects and thus pluck them onto the stack where they take up silly amounts of space. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 12:53 AM, Ingo Molnar [EMAIL PROTECTED] wrote: * David Miller [EMAIL PROTECTED] wrote: From: Ingo Molnar [EMAIL PROTECTED] Date: Tue, 26 Aug 2008 09:22:20 +0200 And i guess the next generation of 4K CPUs support should just get away from cpumask_t-on-kernel-stack model altogether, as the current model is not maintainable. We tried the on-kernel-stack variant, and it really does not work reliably. We can fix this in v2.6.28. I recenetly did some work on sparc64 to use cpumask pointers as much as possible. The only case that didn't work was due to a limitation in arch interfaces for the new generic smp_call_function() code. It passes a cpumask_t instead of a pointer to one via arch_send_call_function_ipi(). But other than that, the whole sparc64 SMP stuff uses cpumask_t pointers only. wonder if could use unsigned long * directly. so could dyn_array directly like int cpumask_size; unsigned long *online_cpu_map; DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size, PAGE_SIZE, NULL); and after nr_cpu_ids is assigned, have cpumask_size = (nr_cpu_ids + sizeof(unsigned long) - 1)/sizeof(unsigned long); then we could NR_CPUS=4096 kernel to small system. ... YH -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Yinghai Lu wrote: wonder if could use unsigned long * directly. I would actually suggest something like this: - we continue to have a magic cpumask_t. - we do different cases for big and small NR_CPUS: #if NR_CPUS = BITS_PER_LONG /* * Make it an array - that way passing it as an argument will * always pass it as a pointer! */ typedef unsigned long cpumask_t[1]; static inline void create_cpumask(cpumask_t *p) { *p = 0; } static inline void free_cpumask(cpumask_t *p) { } #else typedef unsigned long *cpumask_t; static inline void create_cpumask(cpumask_t *p) { *p = kcalloc(..); } static inline void free_cpumask(cpumask_t *p) { kfree(*p); } #endif and now after you do this, you can just do something like cpumask_t mycpu; create_cpumask(mycpu); .. free_cpumask(mycpu); and in between, you can use 'cpumask' as a pointer, because even when it is an array directly allocated on the stack, the array can always degenerate into a pointer by C type rules! And for the small-NR_CPUS case there is zero overhead. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 9:51 AM, Linus Torvalds [EMAIL PROTECTED] wrote: On Tue, 26 Aug 2008, Yinghai Lu wrote: wonder if could use unsigned long * directly. I would actually suggest something like this: - we continue to have a magic cpumask_t. - we do different cases for big and small NR_CPUS: #if NR_CPUS = BITS_PER_LONG /* * Make it an array - that way passing it as an argument will * always pass it as a pointer! */ typedef unsigned long cpumask_t[1]; static inline void create_cpumask(cpumask_t *p) { *p = 0; } static inline void free_cpumask(cpumask_t *p) { } #else typedef unsigned long *cpumask_t; static inline void create_cpumask(cpumask_t *p) { *p = kcalloc(..); } static inline void free_cpumask(cpumask_t *p) { kfree(*p); } #endif and now after you do this, you can just do something like cpumask_t mycpu; create_cpumask(mycpu); .. free_cpumask(mycpu); and in between, you can use 'cpumask' as a pointer, because even when it is an array directly allocated on the stack, the array can always degenerate into a pointer by C type rules! that is good for local variables. for global variables, need to allocate them in some point. may need one int cpumask_size; cpumask_t online_cpu_map; DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size, PAGE_SIZE, NULL); or something like that. YH -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Rusty Russell wrote: Your workaround is very random, and that scares me. I think a huge number of CPUs needs a real solution (an actual cpumask allocator, then do something clever if we come across an actual fastpath). The thing is, the inlining thing is a separate issue. Yes, the cpumasks were what made stack pressure so critical to begin with, but no, a 400-byte stack frame in a deep callchain isn't acceptable _regardless_ of any cpumask_t issues. Gcc inlining is a total and utter pile of shit. And _that_ is the problem. I seriously think we shouldn't allow gcc to inline anything at all unless we tell it to. That's how it used to work, and quite frankly, that's how it _should_ work. The downsides of inlining are big enough from both a debugging and a real code generation angle (eg stack usage like this), that the upsides (_somesimes_ smaller kernel, possibly slightly faster code) simply aren't relevant. So the noinline was random, yes, but this is a real issue. Looking at checkstack output for a saner config (NR_CPUS=16), the top entries for me are things like ide_generic_init [vmlinux]: 1384 idefloppy_ioctl [vmlinux]: 1208 e1000_check_options [vmlinux]: 1152 ... which are leaf functions. They are broken as hell (the e1000 is apparently because it builds structs on the stack that should all be static const, for example), but they are different from something like the module init sequence in that they are not going to affect anything else. It would be interesting to see what -fno-default-inline does to the kernel. It really would get rid of a _lot_ of gcc version issues too. Inlining behavior of gcc has long been a problem for us. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 10:35:05AM -0700, Linus Torvalds wrote: On Tue, 26 Aug 2008, Rusty Russell wrote: Your workaround is very random, and that scares me. I think a huge number of CPUs needs a real solution (an actual cpumask allocator, then do something clever if we come across an actual fastpath). The thing is, the inlining thing is a separate issue. Yes, the cpumasks were what made stack pressure so critical to begin with, but no, a 400-byte stack frame in a deep callchain isn't acceptable _regardless_ of any cpumask_t issues. Gcc inlining is a total and utter pile of shit. And _that_ is the problem. I seriously think we shouldn't allow gcc to inline anything at all unless we tell it to. That's how it used to work, and quite frankly, that's how it _should_ work. The downsides of inlining are big enough from both a debugging and a real code generation angle (eg stack usage like this), that the upsides (_somesimes_ smaller kernel, possibly slightly faster code) simply aren't relevant. ... It would be interesting to see what -fno-default-inline does to the kernel. It really would get rid of a _lot_ of gcc version issues too. Inlining behavior of gcc has long been a problem for us. I added -fno-inline-functions-called-once -fno-early-inlining to KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel image by 2%. And when David's -fwhole-program --combine will become ready the cost of disallowing gcc to inline functions will most likely increase. A debugging option (for better traces) to disallow gcc some inlining might make sense (and might even make sense for distributions to enable in their kernels), but when you go to use cases that require really small kernels the cost is too high. But if you don't trust gcc's inlining you should revert commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's freedom regarding what to inline in 2.6.27 - what gcc 4.2 does in the case of the regression tracked as Bugzilla #11276 is really not funny (two callers - function not inlined; gcc seems to emit the function although both callers later get removed (or at least should be removed) by dead code elimination). Linus cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Adrian Bunk wrote: A debugging option (for better traces) to disallow gcc some inlining might make sense (and might even make sense for distributions to enable in their kernels), but when you go to use cases that require really small kernels the cost is too high. You ignore the fact that it's really not just about debugging. Inlining really isn't the great tool some people think it is. Especially not since gcc stack allocation is so horrid that it won't re-use stack slots etc (which I don't disagree with per se - it's _hard_ to re-use stack slots while still allowing code scheduling). NOTE! I also would never claim that _our_ choices of inline are all that great, and we've often inlined too much or not inlined things that really could be inlined. But at least when a developer says inline (or forgets to say it), we have somebody to blame. When the compiler does insane things that doesn't suit us, we're just screwed. But if you don't trust gcc's inlining you should revert commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's freedom regarding what to inline in 2.6.27 Actually, that just allows gcc to _not_ inline. Which is probably ok. (Well, it would be ok if gcc did it well enough, it obviously has some problems at times). Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Adrian Bunk wrote: I added -fno-inline-functions-called-once -fno-early-inlining to KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel image by 2%. Btw, did you check with just -fno-inline-functions-called-once? The -fearly-inlining decisions _should_ be mostly right. If gcc sees early that a function is so small (even without any constant propagation etc) that it can be inlined, it's probably right. The inline-functions-called-once thing is what causes even big functions to be inlined, and that's where you find the big downsides too (eg the stack usage). Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Mon, 25 Aug 2008, Linus Torvalds wrote: checkstack.pl shows these things as the top problems: 0x80266234 smp_call_function_mask [vmlinux]:2736 0x80234747 __build_sched_domains [vmlinux]: 2232 0x8023523f __build_sched_domains [vmlinux]: 2232 Anyway, the reason smp_call_function_mask and friends have such _huge_ stack usages for you is that they contain a 'cpumask_t' on the stack. In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in size. And they tend to call each other. Quite frankly, I don't think we were really ready for 4k CPU's. I'm going to commit this patch to make sure others don't do that many CPU's by mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and also limits the number of CPU's that you _can_ select to just 512. Right now, 4k cpu's is known broken because of the stack usage. I'm not willing to debug more of these kinds of stack smashers, they're really nasty to work with. I wonder how many other random failures these have been involved with? This patch also makes the ifdef mess in Kconfig much cleaner and avoids duplicate definitions by just conditionally suppressing the question and giving higher defaults. We can enable MAXSMP and raise the CPU limits some time in the future. But that future is not going to be before 2.6.27 - the code simply isn't ready for it. The reason I picked 512 CPU's as the limit is that we _used_ to limit things to 255. So it's higher than it used to be, but low enough to still feel safe. Considering that a 4k-bit CPU mask (512 bytes) _almost_ worked, the 512-bit (64 bytes) masks are almost certainly fine. Still, sane people should limit their NR_CPUS to 8 or 16 or something like that. Very very few people really need the pain of big NR_CPUS. Not even just 512 CPU's. Travis, Ingo and Thomas cc'd, since they were involved in the original commit (1184dc2ffe2c8fb9afb766d870850f2c3165ef25) that raised the limit. Linus Hi Linus, [Sorry for the long winded response, but I felt that sufficient background is needed to address this issue YOMV :-)] The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is critical to our upcoming SGI systems using what we have been calling UV. This system will be capable of supporting 4096 cpu threads in a single system image (and 16k cpus/2k nodes is right around the corner). While obviously I cannot divulge too many details, it's sufficient to say there are customers who not only require this extended capability, but are extremely excited about it. But the nature of some of these system environments is that they will not accept a specially built kernel, but only a kernel that has been built and certified (both from the application standpoint as well as the security standpoint) by standard distributions. And you probably know how extensively these distributions test and certify for many known defects and absolutely require that incoming source changes come from the community supported source bases, primarily yours. Due to the lead time required to accomplish these certifications, the version of the distributions that will be available when this system releases will be based on 2.6.27. (They will allow patches post-2.6.27-rc.final as long as those are committed in the source base.) The two distributions that SGI supports for our customers is SLES (SUSE Linux Enterprise Server) and RHEL (Red Hat Enterprise Linux). [They, of course, are free to run any OS of their choosing, but SGI only provides front line support for those two.] I started last August to begin analyzing how to accomplish the above goals and where exactly are the hot spots in the kernel that would require attention. It quickly became clear that cpumask_t and nodemask_t are two variables that are very casually used (along with NR_CPUS), because the assumption was that 64 was more than sufficient for an upper limit and even extending it to 128 or 255 (254 was the maximum IPI broadcast ID until x2apic), only added a few more bytes here and there. I chose not to introduce too many dramatic changes and instead analyzed every instance where cpumask_t and NR_CPUS was being used (along with the node counterparts.) An initial proposal was to allow the default stack size to be increased, but this was met with a lot of objections because of the extensive work that was done to bring it to it's current size. So in summary, the goals of the changes that I have been making since last October are: 1. Allow a typical distro configured kernel with NR_CPUS=4096 and NODES_SHIFT=9 to be booted on an x86_64 system with 2GB of memory. (Some thought was given to use a 512Mb laptop as the base, but because of other memory bloat from using a 64-bit kernel, that was considered not very useful.) [Note I
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: The inline-functions-called-once thing is what causes even big functions to be inlined, and that's where you find the big downsides too (eg the stack usage). That's a bit bizarre, though, isn't it? A function which is only called from one place should, if everything made sense, _never_ use more stack through being inlined. Inlining should just increase the opportunities that the called function's local variables can share the same stack slots are the caller's dead locals. Whereas not inlining guarantees they occupy separate, immediately adjacent regions of the stack, and shouldn't be increasing the total numbers of local variables. -- Jamie -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Ingo Molnar wrote: * Linus Torvalds [EMAIL PROTECTED] wrote: On Mon, 25 Aug 2008, Linus Torvalds wrote: checkstack.pl shows these things as the top problems: 0x80266234 smp_call_function_mask [vmlinux]:2736 0x80234747 __build_sched_domains [vmlinux]: 2232 0x8023523f __build_sched_domains [vmlinux]: 2232 Anyway, the reason smp_call_function_mask and friends have such _huge_ stack usages for you is that they contain a 'cpumask_t' on the stack. In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in size. And they tend to call each other. Quite frankly, I don't think we were really ready for 4k CPU's. I'm going to commit this patch to make sure others don't do that many CPU's by mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and also limits the number of CPU's that you _can_ select to just 512. yeah, that's OK i guess - distros can still enable 4K support if they wish to. Someone interested in improving the stack footprint situation should dust off the max-stack-footprint tracer so that we can catch these things in a more structured way. And i guess the next generation of 4K CPUs support should just get away from cpumask_t-on-kernel-stack model altogether, as the current model is not maintainable. We tried the on-kernel-stack variant, and it really does not work reliably. We can fix this in v2.6.28. Ingo I would be most interested in any tools to analyze call-trees and accumulated stack usages. My current method of using kdb is really time consuming. Thanks! Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Mike Travis wrote: The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is critical to our upcoming SGI systems using what we have been calling UV. That's fine. You can do it. The default kernel will not, because it's clearly not safe. I really don't care what you do to _your_ images. But I will not distribute a known-broken kernel, and I will not debug random stack overflows that happen in it. If you want the default kernel to support 4k cores, we'll need to fix the stack usage. I don't think that is impossible, but IT IS NOT GOING TO HAPPEN for 2.6.27. And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by default, they are totally and utterly crazy. But some SGI-specific binary that is meant for SGI machines only, and has been extensively tested with the setup used on SGI machines is a different thing. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Jamie Lokier wrote: A function which is only called from one place should, if everything made sense, _never_ use more stack through being inlined. But that's simply not true. See the whole discussion. The problem is that if you inline that function, the stack usage of the newly inlined function is now added to ALL THE OTHER paths too! So the case we had in module loading was that yes, we had a function with a big stack footprint, but it was NOT in the deep path. But by inlining it, it now moved the stack footprint up one level to another function, and now the big stack footprint really _was_ in the deep path, because the caller was involved in a much deeper chain. So inlining moves the code up the callchain, and that is a problem for the backtrace, but that's just a debugging issue. But it also moves the stack footprint up the callchain, and that can actually be a correctness issue. Of course, a compiler doesn't _have_ to do that. A compiler _could_ have multiple different stack footprints for a single function, and do liveness analysis etc. But no sane compiler probably does that, because it's very painful indeed, and it's not even an issue if you aren't stack-limited (and being stack-limited is really just a kernel thing). (Yeah, it can be an issue even if you have a big stack, in that you get worse cache behaviour, so a dense stack footprint _would_ help. But the complexity of stack liveness analysis is almost certainly not worth the relatively small gains it would get on some odd cases). Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Tue, 26 Aug 2008, Mike Travis wrote: The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is critical to our upcoming SGI systems using what we have been calling UV. That's fine. You can do it. The default kernel will not, because it's clearly not safe. I really don't care what you do to _your_ images. But I will not distribute a known-broken kernel, and I will not debug random stack overflows that happen in it. If you want the default kernel to support 4k cores, we'll need to fix the stack usage. I don't think that is impossible, but IT IS NOT GOING TO HAPPEN for 2.6.27. And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by default, they are totally and utterly crazy. But some SGI-specific binary that is meant for SGI machines only, and has been extensively tested with the setup used on SGI machines is a different thing. Linus Ok, thanks for the reply, and looking into this issue. We will strongly encourage our distros to base the relevant releases on 2.6.28. :-) [Supplying an SGI-specific kernel would not be acceptable to many of our customers because of the certification issues I mentioned.] Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 12:09:46PM -0700, Linus Torvalds wrote: If you want the default kernel to support 4k cores, we'll need to fix the stack usage. I don't think that is impossible, but IT IS NOT GOING TO HAPPEN for 2.6.27. And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by default, they are totally and utterly crazy. heh. *picks through Fedora changelog* * Thu Aug 14 2008 Dave Jones [EMAIL PROTECTED] - Bump max cpus supported on x86-64 to 4096. Just to see what happens. I never did get to find out unfortunatly, because of the security fiasco in Fedora infrastructure the last week or two. But some SGI-specific binary that is meant for SGI machines only, and has been extensively tested with the setup used on SGI machines is a different thing. Every extra kernel image a distro vendor ends up shipping has an associated cost. * build time: It currently takes about 2 hours for a set of Fedora RPMs. For RHEL it'll be even worse due to the extra archs). Killing off -smp specific builds was a big win for us in this regard. Adding extra flavours is always painful. * diskspace (distro kernels aren't small. With the associated debugging symbols, they take up a shitload of disk space really fast). * Having everyone running the same kernel makes it much easier to test/debug. Our QA guys hate adding extra columns to their test matrix. But yes, for this to be even remotely feasible, there has to be a negligable performance cost associated with it, which right now, we clearly don't have. Given that the number of people running 4096 CPU boxes even in a few years time will still be tiny, punishing the common case is obviously absurd. Dave -- http://www.codemonkey.org.uk -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Mike Travis wrote: I would be most interested in any tools to analyze call-trees and accumulated stack usages. My current method of using kdb is really time consuming. Well, even just scripts/checkstack.pl is quite relevant. The fact is, anything with a stack footprint of more than a hundred bytes is suspect. We _do_ have a lot of cases of several hundred bytes, and some of them are even very intentional. For an example of _intentional_ and valid large stacks, look at do_sys_poll and do_select. They both have a big stack footprint in a normal kernel, and that's on purpose - it's not pretty, but they are very common and performance-sensitive functions, and using a big stack allows some basic allocations to be much cheaper by default. Same goes for early_printk(), although I don't think the reasons are really very strong in that case. Sadly, while those functions are _fairly_ high up, they aren't at the top, and we do have a lot of other functions that have huge stack footprints for totally bogus reasons. But the intentional ones are at least in the top ten. But the kernel that Alan had problems with was different. The _intentional_ ones were way down in the noise. do_sys_poll wasn't in the top ten, it was barely even in the top 50! (It was in fact #49, to be exact). So look at the top ten in my kernel: 1 ide_generic_init [vmlinux]: 1384 2 idefloppy_ioctl [vmlinux]: 1208 3 e1000_check_options [vmlinux]: 1152 4 do_sys_poll [vmlinux]: 904 5 ide_floppy_get_capacity [vmlinux]: 872 6 do_select [vmlinux]:744 7 early_printk [vmlinux]: 720 8 do_task_stat [vmlinux]: 680 9 mmc_ioctl [vmlinux]:648 10 elf_kcore_store_hdr [vmlinux]: 576 .. and in Alan's kernel: 1 smp_call_function_mask [vmlinux]: 2736 2 __build_sched_domains [vmlinux]:2232 3 setup_IO_APIC_irq [vmlinux]:1616 4 arch_setup_ht_irq [vmlinux]:1600 5 arch_setup_msi_irq [vmlinux]: 1600 6 __assign_irq_vector [vmlinux]: 1592 7 move_task_off_dead_cpu [vmlinux]: 1592 8 tick_handle_oneshot_broadcast [vmlinux]:1544 9 store_scaling_governor [vmlinux]: 1376 10 cpuset_write_resmask [vmlinux]: 1360 That's a big difference. The top #1 in my kernel would just _barely_ be in the top 10 in Alan's kernel (he doesn't have it at all, because he didn't compile the drives I did into the kernel). And the top three in my kernel are just because of crap code. That e1000_check_options thing is there just because it creates multiple struct e1000_option structures. I wrote an ugly but totally trivial patch to get it down to ~600 bytes, and it would be less if I had bothered to waste any more time on it. The others are similar issues of people just didn't think. But look at the top ones in Alan's kernel. Not only are they _much_ bigger than the top ones in a sane kernel, they are _all_ due to cpumask_t, I think. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: The downsides of inlining are big enough from both a debugging and a real code generation angle (eg stack usage like this), that the upsides (_somesimes_ smaller kernel, possibly slightly faster code) simply aren't relevant. So the noinline was random, yes, but this is a real issue. Looking at checkstack output for a saner config (NR_CPUS=16), the top entries for me are things like ide_generic_init [vmlinux]: 1384 idefloppy_ioctl [vmlinux]: 1208 e1000_check_options [vmlinux]: 1152 ... which are leaf functions. They are broken as hell (the e1000 is apparently because it builds structs on the stack that should all be static const, for example), but they are different from something like the module init sequence in that they are not going to affect anything else. e1000_check_options builds a struct (singular) on the stack, really... struct e1000_option is reasonably small. The problem, which has also shown itself in large ioctl-style switch{} statements, is that gcc will generate code such that the stack usage from independent code branches if {cond1} { char buster1[1000]; foo(buster1); } else if (cond2) { char buster2[1000]; foo(buster2); } are added together, not noticed as mutually exclusive. Of course, adding 'static const' as you noted is a reasonable workaround, but gcc is really annoying WRT stack allocation in this manner. I had problems in the past, before struct ethtool_ops, with like ethtool ioctl switch statements using gobs of stack. In fact, that was a big motivation for struct ethtool_ops. Jeff -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Dave Jones wrote: ... But yes, for this to be even remotely feasible, there has to be a negligable performance cost associated with it, which right now, we clearly don't have. Given that the number of people running 4096 CPU boxes even in a few years time will still be tiny, punishing the common case is obviously absurd. Dave I did do some fairly extensive benchmarking between configs of NR_CPUS = 128 and 4096 and most performance hits were in the neighborhood of 5% on systems with 8 cpus and 4GB of memory (our most common test system). [But changing cpumask_t's to be pointers instead of values will likely increase this.] I've tried to be very sensitive to this issue with all my previous changes, so convincing the distros to set NR_CPUS=4096 would be as painless for them as possible. ;-) Btw, huge count cpu systems I don't think are that far away. I believe the nextgen Larabbee chips will be geared towards HPC applications [instead of just GFX apps], and putting 4 of these chips on a motherboard would add up to 512 cpu threads (1024 if they support hyperthreading.) Thanks, Mike -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 horridness (was Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected)
Linus Torvalds wrote: On Tue, 26 Aug 2008, Jeff Garzik wrote: e1000_check_options builds a struct (singular) on the stack, really... struct e1000_option is reasonably small. No it doesn't. Look a bit more closely. It builds a struct (singular) MANY MANY times. It also then builds up a huge e1000_opt_list[] array, even though it is const and should be static (and const). I know. I wrote a patch to FIX it. totally cool patch afaics - if I still maintained this driver I'd have this tested and merged right away :) I suppose Jeff Kirsher is already doing so right now. I suppose that he'll have to look at the other Intel ethernet drivers as well :) Jeff, please add my: Reveiewed-by: Auke Kok [EMAIL PROTECTED] Cheers, Auke Here's the patch. It shrinks the stack from 1152 bytes to 192 bytes (the first version, that only did the e1000_option part, got it down to 600 bytes). About half comes from not using multiple e1000_option structures, the other half comes from turning the e1000_opt_list[] arrays into static const instead, so that gcc doesn't copy them onto the stack. Most of the patch is actually doing things like turning struct struct e1000_option opt = { (which declares a _new_ e1000_option variable each time) into opt = (struct e1000_option) { which just re-uses the single variable. It becomes slightly larger than that, because some places the opt = .. had to be moved around, since it's no longer a variable declaration, but a regular assignment. The rest is just adding const to the right places, and turning struct e1000_opt_list speed_list[] = .. into static const struct e1000_opt_list speed_list[] = .. instead, and fixing the indentation to be more straightforward. I have not tested the dang thing, but I think it's correct. And it turns stack usage from totally horrible and broken into pretty reasonable. Linus --- drivers/net/e1000/e1000_param.c | 81 +- 1 files changed, 45 insertions(+), 36 deletions(-) diff --git a/drivers/net/e1000/e1000_param.c b/drivers/net/e1000/e1000_param.c index b9f90a5..213437d 100644 --- a/drivers/net/e1000/e1000_param.c +++ b/drivers/net/e1000/e1000_param.c @@ -208,7 +208,7 @@ struct e1000_option { } r; struct { /* list_option info */ int nr; - struct e1000_opt_list { int i; char *str; } *p; + const struct e1000_opt_list { int i; char *str; } *p; } l; } arg; }; @@ -242,7 +242,7 @@ static int __devinit e1000_validate_option(unsigned int *value, break; case list_option: { int i; - struct e1000_opt_list *ent; + const struct e1000_opt_list *ent; for (i = 0; i opt-arg.l.nr; i++) { ent = opt-arg.l.p[i]; @@ -279,7 +279,9 @@ static void e1000_check_copper_options(struct e1000_adapter *adapter); void __devinit e1000_check_options(struct e1000_adapter *adapter) { + struct e1000_option opt; int bd = adapter-bd_number; + if (bd = E1000_MAX_NIC) { DPRINTK(PROBE, NOTICE, Warning: no configuration for board #%i\n, bd); @@ -287,19 +289,21 @@ void __devinit e1000_check_options(struct e1000_adapter *adapter) } { /* Transmit Descriptor Count */ - struct e1000_option opt = { + struct e1000_tx_ring *tx_ring = adapter-tx_ring; + int i; + e1000_mac_type mac_type = adapter-hw.mac_type; + + opt = (struct e1000_option) { .type = range_option, .name = Transmit Descriptors, .err = using default of __MODULE_STRING(E1000_DEFAULT_TXD), .def = E1000_DEFAULT_TXD, - .arg = { .r = { .min = E1000_MIN_TXD }} + .arg = { .r = { + .min = E1000_MIN_TXD, + .max = mac_type e1000_82544 ? E1000_MAX_TXD : E1000_MAX_82544_TXD + }} }; - struct e1000_tx_ring *tx_ring = adapter-tx_ring; - int i; - e1000_mac_type mac_type = adapter-hw.mac_type; - opt.arg.r.max = mac_type e1000_82544 ? - E1000_MAX_TXD : E1000_MAX_82544_TXD; if (num_TxDescriptors bd) { tx_ring-count = TxDescriptors[bd]; @@ -313,19 +317,21 @@ void __devinit e1000_check_options(struct e1000_adapter *adapter) tx_ring[i].count = tx_ring-count; } { /* Receive Descriptor Count */ - struct e1000_option opt = { + struct e1000_rx_ring *rx_ring = adapter-rx_ring; +
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 11:40:10AM -0700, Linus Torvalds wrote: On Tue, 26 Aug 2008, Adrian Bunk wrote: A debugging option (for better traces) to disallow gcc some inlining might make sense (and might even make sense for distributions to enable in their kernels), but when you go to use cases that require really small kernels the cost is too high. You ignore the fact that it's really not just about debugging. I had in mind that we anyway have to support it for tiny kernels. I simply don't see that we add kconfig options for 5kB of code for tiny kernels but remove something like this that can cause size increases 1%. Inlining really isn't the great tool some people think it is. Especially not since gcc stack allocation is so horrid that it won't re-use stack slots etc (which I don't disagree with per se - it's _hard_ to re-use stack slots while still allowing code scheduling). gcc's stack allocation has become better (that's why we disable unit-at-a-time only for gcc 3.4 on i386). NOTE! I also would never claim that _our_ choices of inline are all that great, and we've often inlined too much or not inlined things that really could be inlined. But at least when a developer says inline (or forgets to say it), we have somebody to blame. When the compiler does insane things that doesn't suit us, we're just screwed. Most LOCs of the kernel are not written by people like you or Al Viro or David Miller, and the average kernel developer is unlikely to do it as good as gcc. For the average driver the choice is realistically between inline's randomly sprinkled across the driver and no inline's, leave it to gcc. And code evolves during the years from tiny with 1 caller to huge with many callers. BTW: I just ran checkstack on a (roughly) allyesconfig kernel, and we have a new driver that allocates unsigned char recvbuf[1500]; on the stack... But if you don't trust gcc's inlining you should revert commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's freedom regarding what to inline in 2.6.27 Actually, that just allows gcc to _not_ inline. Which is probably ok. (Well, it would be ok if gcc did it well enough, it obviously has some problems at times). With the gcc inline's static functions you complain about we have 4-5 years of experience. Suddenly allowing 4 release series of gcc to ignore any inline's is a completely new area for us. I'd generally agree with giving gcc more freedom here, but I'd rather do it right by removing tons of wrong inline's than doing one global change hoping that it will make things better. And whether the optimized inlining actually makes the kernel bigger or smaller depends in my experience on the .config and the gcc version. Linus cu Adrian [1] there are some rare exceptions -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Adrian Bunk wrote: I had in mind that we anyway have to support it for tiny kernels. I actually don't think that is true. If we really were to decide to be stricter about it, and it makes a big size difference, we can probably also add a tool to warn about functions that really should be inline. Inlining really isn't the great tool some people think it is. Especially not since gcc stack allocation is so horrid that it won't re-use stack slots etc (which I don't disagree with per se - it's _hard_ to re-use stack slots while still allowing code scheduling). gcc's stack allocation has become better (that's why we disable unit-at-a-time only for gcc 3.4 on i386). I agree that it has become better. But it still absolutely *sucks*. For example, see the patch I just posted about e1000 stack usage. Even though the variables were all in completely separate scopes, they all got individual space on the stack over the whole lifetime of the function, causing an explosion of stack-space. As such, gcc used 500 bytes too much of stack, just because it didn't re-use the stackspace. That was with gcc-4.3.0, and no, there were hardly any inlining issues involevd, although it is true that inlining actually did make it slightly worse in that case too (but since it was essentially a leaf function, that had little real life impact, since there were no deep callchains below it to care). So the fact is, better simply is not good enough. We still need to do a lot of optimizations _manually_, because gcc cannot see that it can re-use the stack-slots. And sometimes those optimizations are actually performance pessimizations, because in order to make gcc not use all the stack at the same time, you simply have to break things out and force-disable inlining. Most LOCs of the kernel are not written by people like you or Al Viro or David Miller, and the average kernel developer is unlikely to do it as good as gcc. Sure. But we do have tools. We do have checkstack.pl, it's just that it hasn't been an issue in a long time, so I suspect many people didn't even _realize_ we have it, and I certainly can attest to the fact that even people who remember it - like me - don't actually tend to run it all that often. For the average driver the choice is realistically between inline's randomly sprinkled across the driver and no inline's, leave it to gcc. And neither is likely to be a big problem. BTW: I just ran checkstack on a (roughly) allyesconfig kernel, and we have a new driver that allocates unsigned char recvbuf[1500]; on the stack... Yeah, it's _way_ too easy to do bad things. With the gcc inline's static functions you complain about we have 4-5 years of experience. Sure. And most of it isn't all that great. But I do agree that lettign gcc make more decisions is _dangerous_. However, in this case, at least, the decisions it makes would at least make for less inlining, and thus less stack space explosion. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
From: Mike Travis [EMAIL PROTECTED] Date: Tue, 26 Aug 2008 12:06:18 -0700 David Miller wrote: The only case that didn't work was due to a limitation in arch interfaces for the new generic smp_call_function() code. It passes a cpumask_t instead of a pointer to one via arch_send_call_function_ipi(). But other than that, the whole sparc64 SMP stuff uses cpumask_t pointers only. What it comes down to is that you have to do the self cpu and other tests in the cross-call dispatch routines themselves, instead of at the top-level working on cpumask_t objects. Otherwise you have to modify cpumask_t objects and thus pluck them onto the stack where they take up silly amounts of space. Yes, I had proposed either modifying, or supplementing a new smp_call function to pass the cpumask_t as a pointer (similar to set_cpus_allowed_ptr.) But an ABI change such as this was not well received at the time. What it seems to come down to is that any cpumask_t not inside of a dynamically allocated object should be marked const. And that is something we can enforce at compile time. Linus has just suggested dynamically allocating cpumask_t's for such cases but I don't see that as the fix either. Just mark them const and enforce that cpumask_t objects can only be modified when they appear in dynamically allocated objects. You really don't need to modify the ones that passed around functions anyways. The only code that wants to change bits in these things is the cpu cross-call dispatch stuff, and that cpu choice logic can just live where it belongs down in the cross-call dispatch code. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 11:47:01AM -0700, Linus Torvalds wrote: On Tue, 26 Aug 2008, Adrian Bunk wrote: I added -fno-inline-functions-called-once -fno-early-inlining to KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel image by 2%. Btw, did you check with just -fno-inline-functions-called-once? The -fearly-inlining decisions _should_ be mostly right. If gcc sees early that a function is so small (even without any constant propagation etc) that it can be inlined, it's probably right. The inline-functions-called-once thing is what causes even big functions to be inlined, and that's where you find the big downsides too (eg the stack usage). -fno-inline-functions-called-once alone costs me nearly 1% in code size. And I'd expect it to become more with -fwhole-program --combine. If you think we have too many stacksize problems I'd suggest to consider removing the choice of 4k stacks on i386, sh and m68knommu instead of using -fno-inline-functions-called-once: Now that 32bit x86 is no longer used for extreme highend configurations the only serious usecase for 4k stacks are AFAIK space savings on embedded archs. 4k stacks have caused us much pain [1], and the cases where gcc inlined too much were the easy ones. I'm not saying that I'd like removing the choice of 4k stacks, but if we want to reduce the number of stack related problems that's IMHO the better alternative. Linus cu Adrian [1] AFAIR some callpaths in the kernel are still too big BTW: In case anyone wonders about why I suggest removing 4k stacks: My position is that 4k stacks should either be enabled unconditionally or no longer offered at all. And if we remove 4k stacks from 32bit x86 it's no longer realistically maintainable for other architectures. -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Adrian Bunk wrote: If you think we have too many stacksize problems I'd suggest to consider removing the choice of 4k stacks on i386, sh and m68knommu instead of using -fno-inline-functions-called-once: Don't be silly. That makes the problem _worse_. We're much better off with a 1% code-size reduction than forcing big stacks on people. The 4kB stack option is also a good way of saying if it works with this, then 8kB is certainly safe. And embedded people (the ones that might care about 1% code size) are the ones that would also want smaller stacks even more! Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 horridness (was Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected)
On Tue, Aug 26, 2008 at 1:14 PM, Kok, Auke [EMAIL PROTECTED] wrote: Linus Torvalds wrote: On Tue, 26 Aug 2008, Jeff Garzik wrote: e1000_check_options builds a struct (singular) on the stack, really... struct e1000_option is reasonably small. No it doesn't. Look a bit more closely. It builds a struct (singular) MANY MANY times. It also then builds up a huge e1000_opt_list[] array, even though it is const and should be static (and const). I know. I wrote a patch to FIX it. totally cool patch afaics - if I still maintained this driver I'd have this tested and merged right away :) I suppose Jeff Kirsher is already doing so right now You suppose correctly. . I suppose that he'll have to look at the other Intel ethernet drivers as well :) Jeff, please add my: Reveiewed-by: Auke Kok [EMAIL PROTECTED] Will do. -- Cheers, Jeff -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds [EMAIL PROTECTED] wrote: And embedded people (the ones that might care about 1% code size) are the ones that would also want smaller stacks even more! This is something I never understood - embedded devices are not going to run more than a few processes and 4K*(Few Processes) IMHO is not worth a saving now a days even in embedded world given falling memory prices. Or do I misunderstand? Parag -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Parag Warudkar wrote: On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds [EMAIL PROTECTED] wrote: And embedded people (the ones that might care about 1% code size) are the ones that would also want smaller stacks even more! This is something I never understood - embedded devices are not going to run more than a few processes and 4K*(Few Processes) IMHO is not worth a saving now a days even in embedded world given falling memory prices. Or do I misunderstand? Embedded applications span a huge range of sizes, from the very small devices to which you refer, to quite complex devices. The cable settop boxes we develop have over a hundred interrupt sources, typically run 250-300 threads, and have 192+ MiB of memory. For all that, we are very cost sensitive and are under constant pressure to come up with reliable ways to save memory. Parag -- David VomLehn -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 04:00:33PM -0700, David VomLehn wrote: Parag Warudkar wrote: On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds [EMAIL PROTECTED] wrote: And embedded people (the ones that might care about 1% code size) are the ones that would also want smaller stacks even more! This is something I never understood - embedded devices are not going to run more than a few processes and 4K*(Few Processes) IMHO is not worth a saving now a days even in embedded world given falling memory prices. Or do I misunderstand? Embedded applications span a huge range of sizes, from the very small devices to which you refer, to quite complex devices. The cable settop boxes we develop have over a hundred interrupt sources, typically run 250-300 threads, and have 192+ MiB of memory. For all that, we are very cost sensitive and are under constant pressure to come up with reliable ways to save memory. As you say correctly the term embedded gets used for many different devices. And if you have 192+ MiB of memory you have so much that all these kernel size discussions don't really matter. David VomLehn cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, 27 Aug 2008, Adrian Bunk wrote: We're much better off with a 1% code-size reduction than forcing big stacks on people. The 4kB stack option is also a good way of saying if it works with this, then 8kB is certainly safe. You implicitely assume both would solve the same problem. I'm just saying that your logic doesn't hold water. If we can save kernel stack usage, then a 1% increase in kernel size is more than worth it. While 4kB stacks are something we anyway never got 100% working What? Don't be silly. Linux _historically_ always used 4kB stacks. No, they are likely not usable on x86-64, but dammit, they should be more than usable on x86-32 still. But I do not think the problem you'd solve with -fno-inline-functions-called-once is big enough to warrant the size increase it causes. You continually try to see the inlining as a single solution to one problem (debuggability, stack, whatever). The biggest problem with gcc inlining has always been that it has been _unpredictable_. It causes problems in many different ways. It has caused stability issues due to gcc versions doing random things. It causes the stack expansion. It makes stack traces harder for debugging, etc. If it was any one thing, I wouldn't care. But it's exactly the fact that it causes all these problems in different areas. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 04:51:52PM -0700, Linus Torvalds wrote: On Wed, 27 Aug 2008, Adrian Bunk wrote: We're much better off with a 1% code-size reduction than forcing big stacks on people. The 4kB stack option is also a good way of saying if it works with this, then 8kB is certainly safe. You implicitely assume both would solve the same problem. I'm just saying that your logic doesn't hold water. If we can save kernel stack usage, then a 1% increase in kernel size is more than worth it. From some tests the size increase seems to become bigger for smaller kernels, but I don't have any really good data. An interesting question is why most of our architectures for embedded devices only offer bigger stacks: The only architectures offering a 4kB stacks option are: - m68knommu - sh - 32bit x86 The following architectures that are used in embedded devices always use 8kB stacks (or bigger) in your tree: - arm - avr32 - blackfin - cris - frv - h8300 - m32r - m68k - mips - mn10300 (has an #ifdef CONFIG_4KSTACKS but no kconfig option) - powerpc - xtensa While 4kB stacks are something we anyway never got 100% working What? Don't be silly. Linux _historically_ always used 4kB stacks. No, they are likely not usable on x86-64, but dammit, they should be more than usable on x86-32 still. When did we get callpaths like like nfs+xfs+md+scsi reliably working with 4kB stacks on x86-32? ... Linus cu Adrian -- Is there not promise of rain? Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. Only a promise, Lao Er said. Pearl S. Buck - Dragon Seed -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Wed, 27 Aug 2008, Adrian Bunk wrote: When did we get callpaths like like nfs+xfs+md+scsi reliably working with 4kB stacks on x86-32? XFS may never have been usable, but the rest, sure. And you seem to be making this whole argument an excuse to SUCK, adn an excuse to let gcc crap even more on our stack space. Why? Why aren't you saying that we should be able to do better? Instead, you seem to asking us to do even worse than we do now? Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, Aug 26, 2008 at 8:53 PM, Greg Ungerer [EMAIL PROTECTED] wrote: I have some simple devices (network access/routers) with 8MB of RAM, at power up not really being configured to do anything running 25 processes. (Heck there is over 10 kernel processes running!). Configure some interfaces and services and that will easily push past 40. I'd be happy with a 160k saving :-) So you really need to run all 25 processes on that 8Mb box? (For reference even the NGW100 development board comes with 16Mb RAM). Even if you do need those all 25 processes on the 8Mb box, fixing the memory usage of those user space hogs is lot better than trying to save 160Kb in kernel stacks. Last I looked, user space wasn't particularly frugal with memory usage. Parag -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Parag Warudkar wrote: And although you said in your later reply that Linux x86 with 4K stacks should be more than usable - my experiences running a untainted desktop/file server with 4K stack have been always disastrous XFS or not. It _might_ work for some well defined workloads but you would not want to risk 4K stacks otherwise. Umm. How long? 4kB used to be the _only_ choice. And no, there weren't even irq stacks. So that 4kB was not just the whole kernel call-chain, it was also all the irq nesting above it. And yes, we've gotten much worse over time, and no, I can't really suggest going back to that in general. The code bloat has certainly been accompanied by a stack bloat too. But part of it is definitely gcc. Some versions of gcc used to be absolutely _horrid_ when it came to stack usage, especially with some flags, and especially with the crazy inlining that module-at-a-time caused. But I'd be really happy if some embedded people tried to take some of that bloat back, and aim for 4kB stacks. Because it's definitely not unrealistic. At least it _shouldn't_ be. And a lot of the cases of us having structures on the stack is actually not worth it, and tends to be about being lazy rather than anything else. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Tue, 26 Aug 2008, Parag Warudkar wrote: What about deep call chains? The problem with the uptake of 4K stacks seems to be that is not reliably provable that it will work under all circumstances. Umm. Neither is 8k stacks. Nobody proved anything. But yes, some subsystems have insanely deep call chains. And yes, things like the VFS recursion (for symlinks) makes that deeper yet for filesystems, although only on the lookup path. And that is exactly the kind of thing that can exacerbate the problem of the compiler artificially making for a bigger stack footprint of a function (*). For things like the VFS layer, right now we allow a nesting level of 8, I think. If I remember correctly, it was 5 historically. Part of raising that depth, though, was that we actually moved the recursive part into fs/namei.c, and the nesting stack-depth was something pretty damn small when the filesystem used follow_link properly and let the VFS do it for it (ie the callchain to actually look up the link could be deep, but it would not recurse back, and instead just return a pointer, so that the actual _recursive_ part was just __do_follow_link() and is just a few words on the stack). So yes, we do have some deep callchains, but they tend to be pretty well managed for _good_ code. The problems tend to be the areas with lots of indirection layers, and yeah, XFS, MD and ACPI all have those kinds of things. In an embdedded world, many of those should be a non-issue, though. Linus (*) ie the function that _is_ on the deep chain doesn't actually need much of a stack footprint at all itself, but it may call a helper function that is _not_ in the deep chain, and if it gets inlined it may give its excessive stack footprint to the deep chain - and this is _exactly_ the problem that happened with inlining load_module(). -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Sat, 23 Aug 2008, Linus Torvalds wrote: This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but then the call chain shows that there is no interrupt going on. Ahh, later in that thread there's another totally unrelated oops in debug_mutex_add_waiter(). I'd guess that it is really wild pointer corrupting memory, quite possibly due to a double free or something like that. Alan - it would be good to run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do that? Linus I'll add those in - as to the repeatability: The bad kernels seem to repeat quite reliably - not only in terms of counts (5 or 6 times in a row before trying something else), but also in terms of the what - either the original issue () or the other kernel with the later issue (debug_mutex_add_waiter). That's /goodness/ in that it should help narrow it down. I'll make sure the kernel is still failing this morning, and then add in DEBUG_PAGE_ALLOC and if that doesn't help, SLUB debugging... Alan -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Arjan van de Ven wrote: Wonder what gcc is in use? (newer ones tend to be a ton better... but maybe Alex is using a really old one) I'm running Ubuntu 8.04 w/ gcc: gcc (GCC) 4.2.3 (Ubuntu 4.2.3-2ubuntu7) Alan -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Before adding any more debugging, this is the status of my kernel boots: 3 times in a row w/ this same error. (Primary problem is the same, secondary stacks differ of course.) Alan Loading, please [6.482953] busybox used greatest stack depth: 4840 bytes left wait... [6.521876] all_generic_ide used greatest stack depth: 4784 bytes left Begin: Loading essential drivers... ... [6.625509] fuse init (API version 7.9) [6.625509] modprobe used greatest stack depth: 1720 bytes left [6.644854] ACPI: SSDT CFFD0D0A, 08C4 (r1 HPQOEM CPU_TM21 MSFT 10E) [6.651489] BUG: unable to handle kernel NULL pointer dereference at 0858 [6.655631] IP: [8025e302] debug_mutex_add_waiter+0x32/0x80 [6.655631] PGD 21a0a4067 PUD 21a4bd067 PMD 0 [6.655631] Oops: 0002 [1] SMP [6.655631] CPU 1 [6.655631] Modules linked in: processor(+) fan thermal_sys fuse [6.655631] Pid: 1259, comm: modprobe Not tainted 2.6.27-rc3 #29 [6.655631] RIP: 0010:[8025e302] [8025e302] debug_mutex_add_waiter+0x32/0x80 [6.655631] RSP: 0018:88021a4e7998 EFLAGS: 00010002 [6.655631] RAX: RBX: 88021a4e79d8 RCX: [6.655631] RDX: 0001 RSI: 88021a4e79d8 RDI: a0091a60 [6.655631] RBP: 88021a4e79b8 R08: 811deff0 R09: 8800a6fdb000 [6.655631] R10: a008f524 R11: R12: a0091a60 [6.655631] R13: 88021a4e6000 R14: 88021a9c40a0 R15: a0091a98 [6.655631] FS: 7f233f11d6e0() GS:88022fc02a00() knlGS: [6.655631] CS: 0010 DS: ES: CR0: 8005003b [6.655631] CR2: 0858 CR3: 00021a07e000 CR4: 06e0 [6.655631] DR0: DR1: DR2: [6.655631] DR3: DR6: 0ff0 DR7: 0400 [6.655631] Process modprobe (pid: 1259, threadinfo 88021a4e6000, task 88021a9c40a0) [6.655631] Stack: a0091a60 0246 a008f524 [6.655631] 88021a4e7a38 8049f596 a008f524 a0091a18 [6.655631] 88021a4e79d8 88021a4e79d8 [6.655631] Call Trace: [6.655631] [a008f524] ? get_idr+0x44/0xa0 [thermal_sys] [6.655631] [8049f596] mutex_lock_nested+0xa6/0x250 [6.655631] [a008f524] ? get_idr+0x44/0xa0 [thermal_sys] [6.655631] [803635c4] ? idr_pre_get+0x44/0x90 [6.655631] [a008f524] get_idr+0x44/0xa0 [thermal_sys] [6.655631] [a008fe43] thermal_cooling_device_register+0x83/0x250 [thermal_sys] [6.655631] [a019b2a3] acpi_processor_start+0x64b/0x774 [processor] [6.655631] [8031a94b] ? __sysfs_add_one+0x6b/0xa0 [6.655631] [8031ba3c] ? sysfs_do_create_link+0xbc/0x150 [6.655631] [803a7f5e] acpi_start_single_object+0x2d/0x52 [6.655631] [803a9556] acpi_device_probe+0x7e/0x92 [6.655631] [803dd3eb] driver_probe_device+0x9b/0x1a0 [6.655631] [803dd576] __driver_attach+0x86/0x90 [6.655631] [803dd4f0] ? __driver_attach+0x0/0x90 [6.655631] [803dc93d] bus_for_each_dev+0x5d/0x90 [6.655631] [803dd22c] driver_attach+0x1c/0x20 [6.655631] [803dcf79] bus_add_driver+0x1e9/0x260 [6.655631] [a0222000] ? acpi_processor_init+0x0/0x107 [processor] [6.655631] [803dd74f] driver_register+0x5f/0x140 [6.655631] [a0222000] ? acpi_processor_init+0x0/0x107 [processor] [6.655631] [803a9866] acpi_bus_register_driver+0x3e/0x40 [6.655631] [a0222094] acpi_processor_init+0x94/0x107 [processor] [6.655631] [80209040] _stext+0x40/0x180 [6.655631] [802a8911] ? __vunmap+0xa1/0x110 [6.655631] [802676c2] sys_init_module+0x142/0x1dc0 [6.655631] [80367b16] ? __up_read+0x46/0xb0 [6.655631] [8048e570] ? cpu_down+0x0/0x70 [6.655631] [8020c34b] system_call_fastpath+0x16/0x1b [6.655631] [6.655631] [6.655631] Code: 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 89 6d f8 8b 47 08 49 89 d5 49 89 fc 89 c2 25 ff ff 00 00 c1 ea 10 39 c2 74 1d 49 8b 4 [6.655631] RIP [8025e302] debug_mutex_add_waiter+0x32/0x80 [6.655631] RSP 88021a4e7998 [6.655631] CR2: 0858 [6.655631] ---[ end trace 8bbd31df1403e48e ]--- [7.024992] modprobe used greatest stack depth: 408 bytes left [7.030988] BUG: unable to handle kernel NULL pointer dereference at 0048 [7.031053] IP: [8023f39c] do_exit+0x28c/0xa10 [7.031053] PGD 0 [7.031053] Oops: [2] SMP [7.031053] CPU 1 [7.031053] Modules linked in: processor(+) fan thermal_sys fuse [7.031053] Pid: 1259, comm: modprobe Tainted: G D 2.6.27-rc3
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Sat, 23 Aug 2008, Linus Torvalds wrote: This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but then the call chain shows that there is no interrupt going on. Ahh, later in that thread there's another totally unrelated oops in debug_mutex_add_waiter(). I'd guess that it is really wild pointer corrupting memory, quite possibly due to a double free or something like that. Alan - it would be good to run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do that? With /just/ DEBUG_PAGE_ALLOC defined, I have seen two general panic types: o A new double fault w/ SMP_DEBUG_PAGEALLOC problem (prob4.txt) o The NULL pointer dereference @ 0x858 (prob4a.txt) Enabling SLUB debugging to see what that shows Alan Begin: Loading essential drivers... ... [6.680626] fuse init (API version 7.9) [6.680626] modprobe used greatest stack depth: 1720 bytes left [6.704224] double fault: [1] SMP DEBUG_PAGEALLOC [6.704224] CPU 1 [6.704224] Modules linked in: processor(+) fan thermal_sys fuse [6.710629] Pid: 1259, comm: modprobe Not tainted 2.6.27-rc3 #30 [6.710629] RIP: 0010:[802214dc] [802214dc] flat_send_IPI_allbutself+0x2c/0x80 [6.710629] RSP: 0018:88021a513ff8 EFLAGS: 00010282 [6.710629] RAX: 805f5520 RBX: 88021a5141f8 RCX: 003f [6.710629] RDX: 0200 RSI: 80bf7920 RDI: 88021a5141f8 [6.710629] RBP: 88021a514408 R08: 0040 R09: 0040 [6.710629] R10: 1000 R11: R12: 00fc [6.710629] R13: 88021a514eb8 R14: 0001 R15: 8021cba0 [6.710629] FS: 7f39acc136e0() GS:88022fc81a00() knlGS: [6.710629] CS: 0010 DS: ES: CR0: 8005003b [6.710629] CR2: 88021a513fe8 CR3: 00021a469000 CR4: 06e0 [6.710629] DR0: DR1: DR2: [6.710629] DR3: DR6: 0ff0 DR7: 0400 [6.710629] Process modprobe (pid: 1259, threadinfo 88021a514000, task 88021a9da050) [6.710629] Stack: 1BUG: unable to handle kernel paging request at 88021a513ff8 [6.710629] IP: [8020dd22] show_stack_log_lvl+0x82/0x130 [6.710629] PGD 202063 PUD 10067 PMD 22f176163 PTE 80021a513160 [6.710629] Oops: [2] SMP DEBUG_PAGEALLOC [6.710629] CPU 1 [6.710629] Modules linked in: processor(+) fan thermal_sys fuse [6.710629] Pid: 1259, comm: modprobe Not tainted 2.6.27-rc3 #30 [6.710629] RIP: 0010:[8020dd22] [8020dd22] show_stack_log_lvl+0x82/0x130 [6.710629] RSP: 0018:88022f12ce28 EFLAGS: 00010046 [6.710629] RAX: 88022fc81a00 RBX: 88021a513ff8 RCX: 000c [6.710629] RDX: 88021a513ff8 RSI: 88022f12cf58 RDI: [6.710629] RBP: 88022f12ce78 R08: 8059aea9 R09: 0001 [6.710629] R10: R11: R12: [6.710629] R13: 88022f127fc0 R14: 88022f12bfc0 R15: [6.710629] FS: 7f39acc136e0() GS:88022fc81a00() knlGS: [6.710629] CS: 0010 DS: ES: CR0: 8005003b [6.710629] CR2: 88021a513ff8 CR3: 00021a469000 CR4: 06e0 [6.710629] DR0: DR1: DR2: [6.710629] DR3: DR6: 0ff0 DR7: 0400 [6.710629] Process modprobe (pid: 1259, threadinfo 88021a514000, task 88021a9da050) [6.710629] Stack: 88022f12ce78 8059aea9 88021a514408 88022f12cf58 [6.710629] 88021a513ff8 88021a9da050 88022f12cf58 [6.710629] 0040 002b 88022f12ceb8 8020dea8 [6.710629] Call Trace: [6.710629] #DF [8020dea8] show_registers+0xd8/0x260 [6.710629] [8021cba0] ? do_flush_tlb_all+0x0/0x40 [6.710629] [804a2083] __die+0xa3/0x120 [6.710629] [8020e073] die+0x43/0x90 [6.710629] [8020e1e3] do_double_fault+0x63/0x70 [6.710629] [8020d4fd] double_fault+0x7d/0x90 [6.710629] [8021cba0] ? do_flush_tlb_all+0x0/0x40 [6.710629] [802214dc] ? flat_send_IPI_allbutself+0x2c/0x80 [6.710629] EOE [6.710629] [6.710629] Code: 55 d0 85 c9 48 89 d3 7e 5a 45 31 e4 eb 44 4c 39 f3 77 44 66 0f 1f 44 00 00 74 7e 45 85 e4 74 0b 41 f6 c4 03 0f 1f 44 00 00 7 [6.720629] RIP [8020dd22] show_stack_log_lvl+0x82/0x130 [6.720629] RSP 88022f12ce28 [6.720629] CR2: 88021a513ff8 [6.720629] ---[ end trace d4fdf12ff3e07cc3 ]--- [7.052961] modprobe used greatest stack depth: 920 bytes left Killed [6.551876] all_generic_ide used greatest stack depth: 4784
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Adding in SLUB debugging doesn't show anything new (I think). Example boot log (w/ initcall_debug enabled) is at: http://free.linux.hp.com/~adb/bug.11342/prob5.txt This has happened 3 times in a row as well. Whilst this is being looked at, I'm going to fast-forward ahead to the latest in Linus' tree, and see if the problem is still occurring (I think Linus' point earlier about some sort of rogue timing and/or corruption bug is spot on, but it's probably better to see how close to today's tree I can reproduce this). I'll also try kernels w/ the problematic merge patch backed out to see if that still fixes (or more likely(?) just patches over the real problem). Alan -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
I built a kernel @ commit 83097aca8567a0bd593534853b71fe0fa9a75d69 Author: Arjan van de Ven [EMAIL PROTECTED] Date: Sat Aug 23 21:45:21 2008 -0700 And it fails like the others do o http://free.linux.hp.com/~adb/bug.11342/prob6.txt SMP_DEBUG_PAGEALLOC o http://free.linux.hp.com/~adb/bug.11342/prob6a.txt [7.591198] BUG: unable to handle kernel NULL pointer dereference at 0858 I then backed out /just/ the merge for commit 1c89ac55017f982355c7761e1c912c88c941483d Merge: 88fa08f... b1b135c... Author: Linus Torvalds [EMAIL PROTECTED] Date: Tue Aug 12 08:40:19 2008 -0700 And the machine has booted fine 5 times in a row. I've put the latest .config up at http://free.linux.hp.com/~adb/bug.11342/config.txt Is there /some/ way to break down the patches within the merged patch, and I could by-hand bisect through those? Here's what I did to take the latest tree, and back out that merge (to get booting kernels): git-diff 88fa08f67bee1a0c765237bdac106a32872f57d2..1c89ac55017f982355c7761e1c912c88c941483d | patch -p1 -R patching file Documentation/lguest/lguest.c patching file arch/powerpc/Kconfig patching file arch/x86/Kconfig patching file arch/x86/mm/Makefile patching file drivers/char/hvc_console.c patching file drivers/lguest/page_tables.c patching file include/linux/Kbuild Hunk #1 succeeded at 358 (offset 2 lines). patching file include/linux/init.h patching file include/linux/mm.h patching file init/main.c patching file kernel/module.c patching file kernel/stop_machine.c patching file mm/Kconfig patching file mm/util.c Alan -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Mon, 25 Aug 2008, Alan D. Brunelle wrote: Before adding any more debugging, this is the status of my kernel boots: 3 times in a row w/ this same error. (Primary problem is the same, secondary stacks differ of course.) Ok, so I took a closer look, and the oops really is suggestive.. [6.482953] busybox used greatest stack depth: 4840 bytes left Ok, 4840 bytes left out of 8kB. [6.521876] all_generic_ide used greatest stack depth: 4784 bytes left .. and this one is 4784 bytes left.. Begin: Loading essential drivers... ... [6.625509] fuse init (API version 7.9) [6.625509] modprobe used greatest stack depth: 1720 bytes left Uhhuh! The previous modprobe uses stack like mad. It could be fuse_init() that has done it, but looking at fuse, I seriously doubt it. It doesn't seem to do anything particularly bad. So something has used over 6kB of stack, and it may well be the module loading code itself. The next stage is the actual oops itself: [6.644854] ACPI: SSDT CFFD0D0A, 08C4 (r1 HPQOEM CPU_TM21 MSFT 10E) [6.651489] BUG: unable to handle kernel NULL pointer dereference at 0858 This really looks like ti-task-blocked_on = waiter; where ti-task is NULL. You probably have almost everything enabled in order to turn struct task_struct that big, but judging by your register state it's really an offset off a NULL pointer, not some small integer. Now, there is no way ti-task can _possibly_ be NULL. No way. Well, except that ti is just below the stack, and if you had a stack overflow that overwrote it. So I seriously do believe that you have run out of stack. If that is true, then it's quite likely that with DEBUG_PAGE_ALLOC you'll actually get a double fault, which in turn is fairly hard to debug (you look at it wrong and it turns into a triple fault which is going to just reboot your machine immediately). Now, the stack oveflow probably happened a few calls earlier (and just left your thread_info corrupted), but there is more reason to believe you have stack overflow and thread_info corruption later in your output: [7.024992] modprobe used greatest stack depth: 408 bytes left [7.030988] BUG: unable to handle kernel NULL pointer dereference at 0048 [7.031053] IP: [8023f39c] do_exit+0x28c/0xa10 Here there is only 408 bytes left, which is _way_ too little, but it's also an optimistic measure. What the stack code usage code does is to just see how many zeroes it can find on the stack. If you have a big stack frame somewhere, it's quite possible that it actually used all your stack and then some, but left a bunch of zeroes around. And the do_exit() oops is simply because once the thread_info is corrupted, all the basic thread data structures are crap, and yes, you're almost guaranteed to oops at that point. Could you make your kernel image available somewhere, and we can take a look at it? Some versions of gcc are total pigs when it comes to stack usage, and your exact configuration matters too. But yes, module loading is a bad case, for me sys_init_module() contains subq$392, %rsp #, which is probably mostly because of the insane inlining gcc does (ie it will likely have inlined every single function in that file that is only called once, and then it will make all local variables of all those functions alive over the whole function and allocate stack-space for them ALL AT THE SAME TIME). Gcc sometimes drives me mad. It's inlining decisions are almost always pure and utter sh*t. But clearly something changed for you to start triggering this, and I think that also explains why you bisected things to the merge commit rather than to any individual change - because it was probably not any individual change that pushed it over the limit, but two different changes that made for bigger stack pressure, and _together_ they pushed you over the limit. So it also explains why the merge you found had no possible merge errors on a source level - there were no actual clashes anywhere. Just a slow growth of stack that combined to something that overflowed. And yes, I bet the change by Arjan to use do_one_initcall() was _part_ of it. It adds roughly 112 bytes of stack pressure to that module loading path, because of the 64-byte array and the extra function call (8 bytes for return address) with at least 5 quad-words saved (40 bytes) for register spills. But there were probably other things happening too that made things worse. So if there is some place where you can upload your 'vmlinux' binary, it would be good. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Mon, 25 Aug 2008, Alan D. Brunelle wrote: With /just/ DEBUG_PAGE_ALLOC defined, I have seen two general panic types: o A new double fault w/ SMP_DEBUG_PAGEALLOC problem (prob4.txt) Yeah, that's a stack overflow. Confirmed. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Mon, 25 Aug 2008, Linus Torvalds wrote: Could you make your kernel image available somewhere, and we can take a look at it? Some versions of gcc are total pigs when it comes to stack usage, and your exact configuration matters too. But yes, module loading is a bad case, for me sys_init_module() contains subq$392, %rsp #, which is probably mostly because of the insane inlining gcc does (ie it will likely have inlined every single function in that file that is only called once, and then it will make all local variables of all those functions alive over the whole function and allocate stack-space for them ALL AT THE SAME TIME). I bet this one-liner will probably make your kernel work. It's not a full solution, but it will make the module-loading path lose _all_ of the above stack slots by just not inlining load_module() - the stack slots will still be used when the module is _loaded_, but by the time we actually callt he -init function they will have been released since it's not all in the same crazy function any more. I _seriously_ believe that we were better off back when gcc only inlined what we told it to inline, and never inlined on its own. The gcc inlining logic is pure and utter sh*t in an environment like the kernel where stack space is a valuable resource. Anyway, Alan, even if this solves your particular problem, I'd still like to see your kernel image, so that I can hunt for other problems like this.. Linus --- kernel/module.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 08864d2..9db1191 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1799,7 +1799,7 @@ static void *module_alloc_update_bounds(unsigned long size) /* Allocate and load the module: note that size of section 0 is always zero, and we rely on this for optional sections. */ -static struct module *load_module(void __user *umod, +static noinline struct module *load_module(void __user *umod, unsigned long len, const char __user *uargs) { -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Mon, 25 Aug 2008, Linus Torvalds wrote: Could you make your kernel image available somewhere, and we can take a look at it? Some versions of gcc are total pigs when it comes to stack usage, and your exact configuration matters too. But yes, module loading is a bad case, for me sys_init_module() contains subq$392, %rsp #, which is probably mostly because of the insane inlining gcc does (ie it will likely have inlined every single function in that file that is only called once, and then it will make all local variables of all those functions alive over the whole function and allocate stack-space for them ALL AT THE SAME TIME). Mine has: Dump of assembler code for function sys_init_module: 0x802688c0 sys_init_module+0: push %rbp 0x802688c1 sys_init_module+1: mov%rsp,%rbp 0x802688c4 sys_init_module+4: sub$0x1c0,%rsp 0x802688cb sys_init_module+11:mov%r12,-0x20(%rbp) 0x802688cf sys_init_module+15:mov%rdi,%r12 so 448 bytes. The kernel is up at: http://free.linux.hp.com/~adb/bug.11342/vmlinux (if you would let me know when you are through with it so I can free up some space there I'd appreciate it...) By doing the patch you provided, sys_init_module now looks like: Dump of assembler code for function sys_init_module: 0x8026aa20 sys_init_module+0: push %rbp 0x8026aa21 sys_init_module+1: mov%rsp,%rbp 0x8026aa24 sys_init_module+4: sub$0x20,%rsp 0x8026aa28 sys_init_module+8: mov%r14,0x18(%rsp) 0x8026aa2d sys_init_module+13:mov%rdi,%r14 So only 32 bytes. (But of course, load_module() exists, and now has 0x1d0 (464) bytes...) With the patch you provide, I /was/ able to repeatedly boot OK (latest tree, and I also ran the patch against the 26.27.rc3-based kernel I was having problems with initially, and that booted OK as well). Alan I bet this one-liner will probably make your kernel work. It's not a full solution, but it will make the module-loading path lose _all_ of the above stack slots by just not inlining load_module() - the stack slots will still be used when the module is _loaded_, but by the time we actually callt he -init function they will have been released since it's not all in the same crazy function any more. I _seriously_ believe that we were better off back when gcc only inlined what we told it to inline, and never inlined on its own. The gcc inlining logic is pure and utter sh*t in an environment like the kernel where stack space is a valuable resource. Anyway, Alan, even if this solves your particular problem, I'd still like to see your kernel image, so that I can hunt for other problems like this.. Linus --- kernel/module.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/module.c b/kernel/module.c index 08864d2..9db1191 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -1799,7 +1799,7 @@ static void *module_alloc_update_bounds(unsigned long size) /* Allocate and load the module: note that size of section 0 is always zero, and we rely on this for optional sections. */ -static struct module *load_module(void __user *umod, +static noinline struct module *load_module(void __user *umod, unsigned long len, const char __user *uargs) { -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Mon, 25 Aug 2008, Alan D. Brunelle wrote: Mine has: Dump of assembler code for function sys_init_module: 0x802688c4 sys_init_module+4: sub$0x1c0,%rsp so 448 bytes. Yeah, your build seems to have consistently bigger stack usage, and that may be due to some config option, but most likely it's a compiler version issue. But I think part of the reason is that you have frame pointers enabled: that makes the stack frames bigger not only because of the frame pointer save/restore, but also because you have more register pressure and thus spills. The kernel is up at: http://free.linux.hp.com/~adb/bug.11342/vmlinux (if you would let me know when you are through with it so I can free up some space there I'd appreciate it...) I'm downloading it now, I'll probably be done by the time you get this email. [ Update. Done. You can remove it ] By doing the patch you provided, sys_init_module now looks like: Dump of assembler code for function sys_init_module: 0x8026aa24 sys_init_module+4: sub$0x20,%rsp So only 32 bytes. (But of course, load_module() exists, and now has 0x1d0 (464) bytes...) Right - the stack usage didn't go away, but the _lifetimes_ changed. So now load_module() will still use almost 500 bytes of stack, and it will call other routines that use stack too, but the lifetime of that stack usage is no longer over the whole module loading and initialization part, it's purely over just the loading thing. And since the deep callchain came much later (in the actual -init routines), by the time we do that, we no longer now have the load_module stack usage active any more. With the patch you provide, I /was/ able to repeatedly boot OK (latest tree, and I also ran the patch against the 26.27.rc3-based kernel I was having problems with initially, and that booted OK as well). I had actually already committed it, because it was correct regardless (and gcc really is a total ass for doing that inlining to begin with), but it's good to have verification that the behaviour you saw was literally about this thing. I'll look at your vmlinux binary to see what else sucks from a stack depth standpoint, but one of the problems in this whole thing is that the stack usage is obviously both a static thing (with some functions using _way_ too much stack!) _and_ a dynamic thing (with the total stack use being not about any individual function, but the whole chain). My patch obviously doesn't change the static stack usage, it just moves it around a bit so that it's no longer on that same deep path, so the dynamic stack usage is much less. But I'll look at your vmlinux, see what stands out. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Linus Torvalds wrote: On Mon, 25 Aug 2008, Alan D. Brunelle wrote: Mine has: Dump of assembler code for function sys_init_module: 0x802688c4 sys_init_module+4: sub$0x1c0,%rsp so 448 bytes. Yeah, your build seems to have consistently bigger stack usage, and that may be due to some config option, but most likely it's a compiler version issue. I wonder if we ought to have a light version of make checkstack always run, but in such a way that we make a file with limits on the stack usage for key functions (and we can grow this list over time when we learn about critical ones).. and either warn very loudly or even fail the build if we're way over what could work. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Mon, 25 Aug 2008, Linus Torvalds wrote: But I'll look at your vmlinux, see what stands out. Oops. I already see the problem. Your .config has soem _huge_ CPU count, doesn't it? checkstack.pl shows these things as the top problems: 0x80266234 smp_call_function_mask [vmlinux]:2736 0x80234747 __build_sched_domains [vmlinux]: 2232 0x8023523f __build_sched_domains [vmlinux]: 2232 0x8021e884 setup_IO_APIC_irq [vmlinux]: 1616 0x8021ee24 arch_setup_ht_irq [vmlinux]: 1600 0x8021f144 arch_setup_msi_irq [vmlinux]:1600 0x8021e3b0 __assign_irq_vector [vmlinux]: 1592 0x8021e626 __assign_irq_vector [vmlinux]: 1592 0x8023257e move_task_off_dead_cpu [vmlinux]:1592 0x802326e8 move_task_off_dead_cpu [vmlinux]:1592 0x8025dbc5 tick_handle_oneshot_broadcast [vmlinux]:1544 0x8025dcb4 tick_handle_oneshot_broadcast [vmlinux]:1544 0x803f3dc4 store_scaling_governor [vmlinux]:1376 0x80279ef4 cpuset_write_resmask [vmlinux]: 1360 0x803f465d cpufreq_add_dev [vmlinux]: 1352 0x803f495b cpufreq_add_dev [vmlinux]: 1352 0x803f3fc4 store_scaling_max_freq [vmlinux]:1328 0x803f4064 store_scaling_min_freq [vmlinux]:1328 0x803f44c4 cpufreq_update_policy [vmlinux]: 1328 .. and sys_init_module is actually way way down the list. I bet the only reason it showed up at all was because dynamically it was such a deep callchain, and part of that callchain probably called some of those really nasty things. Anyway, the reason smp_call_function_mask and friends have such _huge_ stack usages for you is that they contain a 'cpumask_t' on the stack. For example, for me, usign a sane NR_CPU, the size of the stack frame for smp_call_function_mask is under 200 bytes. For you, it's 2736 bytes. How about you make CONFIG_NR_CPU's something _sane_? Like 16? Or do you really have four thousand CPU's in that system? Oh, I guess you have the MAXSMP config enabled? I really think that was a bit too aggressive. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Mon, 25 Aug 2008, Linus Torvalds wrote: checkstack.pl shows these things as the top problems: 0x80266234 smp_call_function_mask [vmlinux]:2736 0x80234747 __build_sched_domains [vmlinux]: 2232 0x8023523f __build_sched_domains [vmlinux]: 2232 Anyway, the reason smp_call_function_mask and friends have such _huge_ stack usages for you is that they contain a 'cpumask_t' on the stack. In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in size. And they tend to call each other. Quite frankly, I don't think we were really ready for 4k CPU's. I'm going to commit this patch to make sure others don't do that many CPU's by mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and also limits the number of CPU's that you _can_ select to just 512. Right now, 4k cpu's is known broken because of the stack usage. I'm not willing to debug more of these kinds of stack smashers, they're really nasty to work with. I wonder how many other random failures these have been involved with? This patch also makes the ifdef mess in Kconfig much cleaner and avoids duplicate definitions by just conditionally suppressing the question and giving higher defaults. We can enable MAXSMP and raise the CPU limits some time in the future. But that future is not going to be before 2.6.27 - the code simply isn't ready for it. The reason I picked 512 CPU's as the limit is that we _used_ to limit things to 255. So it's higher than it used to be, but low enough to still feel safe. Considering that a 4k-bit CPU mask (512 bytes) _almost_ worked, the 512-bit (64 bytes) masks are almost certainly fine. Still, sane people should limit their NR_CPUS to 8 or 16 or something like that. Very very few people really need the pain of big NR_CPUS. Not even just 512 CPU's. Travis, Ingo and Thomas cc'd, since they were involved in the original commit (1184dc2ffe2c8fb9afb766d870850f2c3165ef25) that raised the limit. Linus --- arch/x86/Kconfig | 30 -- 1 files changed, 8 insertions(+), 22 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 68d91c8..ed92864 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -577,35 +577,29 @@ config SWIOTLB config IOMMU_HELPER def_bool (CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU) + config MAXSMP bool Configure Maximum number of SMP Processors and NUMA Nodes - depends on X86_64 SMP + depends on X86_64 SMP BROKEN default n help Configure maximum number of CPUS and NUMA Nodes for this architecture. If unsure, say N. -if MAXSMP -config NR_CPUS - int - default 4096 -endif - -if !MAXSMP config NR_CPUS - int Maximum number of CPUs (2-4096) - range 2 4096 + int Maximum number of CPUs (2-512) if !MAXSMP + range 2 512 depends on SMP + default 4096 if MAXSMP default 32 if X86_NUMAQ || X86_SUMMIT || X86_BIGSMP || X86_ES7000 default 8 help This allows you to specify the maximum number of CPUs which this - kernel will support. The maximum supported value is 4096 and the + kernel will support. The maximum supported value is 512 and the minimum value which makes sense is 2. This is purely to save memory - each supported CPU adds approximately eight kilobytes to the kernel image. -endif config SCHED_SMT bool SMT (Hyperthreading) scheduler support @@ -996,17 +990,10 @@ config NUMA_EMU into virtual nodes when booted with numa=fake=N, where N is the number of nodes. This is only useful for debugging. -if MAXSMP - config NODES_SHIFT - int - default 9 -endif - -if !MAXSMP -config NODES_SHIFT - int Maximum NUMA Nodes (as a power of 2) + int Maximum NUMA Nodes (as a power of 2) if !MAXSMP range 1 9 if X86_64 + default 9 if MAXSMP default 6 if X86_64 default 4 if X86_NUMAQ default 3 @@ -1014,7 +1001,6 @@ config NODES_SHIFT help Specify the maximum number of NUMA Nodes available on the target system. Increases memory reserved to accomodate various tables. -endif config HAVE_ARCH_BOOTMEM_NODE def_bool y -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Alan D. Brunelle wrote: I think you're right: the kernel as a whole may not be ready for 4,096 CPUs apparently... Mike has been working diligently on getting all these cpumasks off the stack for the last months and has created an infrastructure to do this. So I think we are close. It might just be a matter of merging some more patches that are still left in Ingo's tree. -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Sat, 23 Aug 2008, Rafael J. Wysocki wrote: The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11342 Subject : Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected Submitter : Alan D. Brunelle [EMAIL PROTECTED] Date : 2008-08-13 23:03 (11 days old) References: http://marc.info/?l=linux-kernelm=121866876027629w=4 Handled-By: Andrew Morton [EMAIL PROTECTED] This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but then the call chain shows that there is no interrupt going on. Also, the bisection is senseless - there's a trivial change wrt do_one_initcall() that got merged, but everything else is trivial about lguest and has nothing to do with the whole CPU-init thing. But if it was that initcall one, then git bisect woul have pointed to it, not the merge. And the merge itself had no conflicts or anything else going on.. The fact that it came and went later also implies that it's probably just some timing-dependent thing or some subtle memory corruption, making the bisection result even less likely to be exact. But I'm adding Arjan and Rusty to the Cc, because that merge was takign Rusty's branch, and the do_one_initcall() is Arjan's commit. Since undoing that merge apparently does fix it, I'm wondering if something there just does end up triggering the problem. The do_one_commit() thing _is_ in the path of sys_init_module(), so it _is_ at least somewhat relevant from an oops standpoint. One thing the do_one_commit() thing does is to put more pressure on the stack due to that whole buffer for the printk's going on. Alan, can you try - seeing how consistent it is with one kernel (ie boot a known-bad kernel a few times just to see if it really is 100% consistent) - try enabling 'initcall_debug' on the kernel command line, to (a) see the new code actually do something and (b) see what it is actually calling just before. Hmm.. Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
On Sat, 23 Aug 2008, Linus Torvalds wrote: This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but then the call chain shows that there is no interrupt going on. Ahh, later in that thread there's another totally unrelated oops in debug_mutex_add_waiter(). I'd guess that it is really wild pointer corrupting memory, quite possibly due to a double free or something like that. Alan - it would be good to run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do that? Linus -- To unsubscribe from this list: send the line unsubscribe kernel-testers in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html