Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-30 Thread Ingo Molnar

* Mike Travis [EMAIL PROTECTED] wrote:

  could you please send whatever .c changes you have already, so that 
  we can have a look at how the end result will look like? Doesnt have 
  to build, i'm just curious about how it looks like in practice, 
  semantically.
 
 
 I will, and the full allyesconfig does compile.  And it's basically 
 a benign change in that the functionality is still the same.  I'm 
 currently reordering it a bit to clean it up.

btw., are the resulting instructions also expected to be the same? If 
yes then you might want to verify it all by making sure the md5's of the 
.o's do not change.

(If that's not possible (gcc decides to compile it a bit differently) 
then no big deal, just wanted to mention the possibility.)

Ingo
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-30 Thread Mike Travis
Ingo Molnar wrote:
 * Mike Travis [EMAIL PROTECTED] wrote:
 
 could you please send whatever .c changes you have already, so that 
 we can have a look at how the end result will look like? Doesnt have 
 to build, i'm just curious about how it looks like in practice, 
 semantically.

 I will, and the full allyesconfig does compile.  And it's basically 
 a benign change in that the functionality is still the same.  I'm 
 currently reordering it a bit to clean it up.
 
 btw., are the resulting instructions also expected to be the same? If 
 yes then you might want to verify it all by making sure the md5's of the 
 .o's do not change.
 
 (If that's not possible (gcc decides to compile it a bit differently) 
 then no big deal, just wanted to mention the possibility.)
 
   Ingo

Well, not exactly... ;-)  It does institute the new API change that specifies
only pointers to cpumask's can be passed to functions and returned from
functions.  I really wanted the default cpumask_t to be a constant so those
instances where the passed in cpumask is used as a read/write temp variable
would be caught.  But it started getting messy.

One pain is:

typedef struct __cpumask_s *cpumask_t;
const cpumask_t xxx;

is not the same as:

typedef const struct __cpumask_s *const_cpumask_t;
const_cpumask_t xxx;

and I'm not exactly sure why.  It came up when I tried to declare
functions that returned a constant cpumask_t pointer (node_to_cpumask,
cpumask_of_cpu, etc.)

The other major change I'm contemplating is to remove cpumask_t completely
(maybe cpumask_ptr_t?).  This would force every instance of cpumask_t to be
examined.  (I found quite a few I had missed in my original edits when I
added the task struct temp cpumask's.)

Oh yeah, one question ... is current always valid?

Thanks,
Mike

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-30 Thread Linus Torvalds


On Tue, 30 Sep 2008, Mike Travis wrote:
 
 One pain is:
 
   typedef struct __cpumask_s *cpumask_t;
   const cpumask_t xxx;
 
 is not the same as:
 
   typedef const struct __cpumask_s *const_cpumask_t;
   const_cpumask_t xxx;
 
 and I'm not exactly sure why.

Umm. The const has different 

One is

typedef const struct __cpumask_s *const_cpumask_t;

which becomes

(const struct __cpumask_s) *

while the other is

const cpumask_t xxx

which is

const (struct __cpumask_s *)

and if you look a bit more closely, you'll see that they are _obviously_ 
not the same thing at all.

Quite frankly, I personally do hate typedefs that end up being pointers, 
and used as pointers, without showing that in the source code.

When you do

type_t a;

fn(a);

I expect the code to essentially do a pass-by-value. But when the type_t 
is a pointer, that doesn't really work.

Your issue with 'const' is just another version of the same. You don't 
want the _pointer_ to be const, you want what it points _to_ to be const. 
But because you hid the pointerness inside the typedef, you simply cannot 
do that.

The problem with cpumask's, of course, is that for the small mask case, 
we really don't want it to be a pointer. So now it's sometimes a pointer 
and sometimes not. The typedef hides that, and I understand why it's a 
good idea, but I'm surprised you didn't understand what the implications 
were for 'const', and I'm now a bit more leery about this whole thing just 
because the typedef ends up hiding so much - it doesn't just hide the 
basic type, it hides a very basic *code* issue.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-30 Thread Mike Travis
Linus Torvalds wrote:
 
 On Tue, 30 Sep 2008, Mike Travis wrote:
 One pain is:

  typedef struct __cpumask_s *cpumask_t;
  const cpumask_t xxx;

 is not the same as:

  typedef const struct __cpumask_s *const_cpumask_t;
  const_cpumask_t xxx;

 and I'm not exactly sure why.
 
 Umm. The const has different 
 
 One is
 
   typedef const struct __cpumask_s *const_cpumask_t;
 
 which becomes
 
   (const struct __cpumask_s) *
 
 while the other is
 
   const cpumask_t xxx
 
 which is
 
   const (struct __cpumask_s *)
 
 and if you look a bit more closely, you'll see that they are _obviously_ 
 not the same thing at all.

Thanks, yes that explains what I should have figured out.  (Nice way to
explain it btw... ;-)

 
 Quite frankly, I personally do hate typedefs that end up being pointers, 
 and used as pointers, without showing that in the source code.
 
 When you do
 
   type_t a;
 
   fn(a);
 
 I expect the code to essentially do a pass-by-value. But when the type_t 
 is a pointer, that doesn't really work.

I agree, and as I mentioned, Rusty was working towards an alternative
method of declaring cpumask's which does not hide this.

My goal was to create an (apparent) opaque handle for cpumask_t and modify
the code so all changes to the contents of the cpumask are via functions.

 
 Your issue with 'const' is just another version of the same. You don't 
 want the _pointer_ to be const, you want what it points _to_ to be const. 
 But because you hid the pointerness inside the typedef, you simply cannot 
 do that.
 
 The problem with cpumask's, of course, is that for the small mask case, 
 we really don't want it to be a pointer. So now it's sometimes a pointer 
 and sometimes not. The typedef hides that, and I understand why it's a 
 good idea, but I'm surprised you didn't understand what the implications 
 were for 'const', and I'm now a bit more leery about this whole thing just 
 because the typedef ends up hiding so much - it doesn't just hide the 
 basic type, it hides a very basic *code* issue.

Perhaps then defining the cpumask as a list of unsigned longs (remove the
outer struct) would play more naturally.  Lists by definition are always
referenced by pointers.  I have another version of the patchset that shows
this and I'll post just the cpumask.h and doc files.

 
   Linus

Thanks!
Mike
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-30 Thread Rusty Russell
On Wednesday 01 October 2008 02:46:59 Linus Torvalds wrote:
 Quite frankly, I personally do hate typedefs that end up being pointers,
 and used as pointers, without showing that in the source code.
...
 I'm now a bit more leery about this whole thing just
 because the typedef ends up hiding so much - it doesn't just hide the
 basic type, it hides a very basic *code* issue.

Yes, this is why my version of the rework moved away from typedefs, except for 
the special case of cpumask_var_t for stack vars where this trick is really 
desired.

Everywhere else, the code becomes nice and clear: struct cpumask *.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-29 Thread Mike Travis
Ingo Molnar wrote:
 * Mike Travis [EMAIL PROTECTED] wrote:
 
 Hi Rusty,

 I've gotten some good traction on the changes in the following patch.  
 About 30% of the kernel is compiling right now and I'm picking up 
 errors and warnings as I'm going along.  I think it's doing most of 
 what we need.  Attempting to hide the cpumask struct definition caused 
 all kinds of problems with the inline functions and statically 
 declaring cpumask's.

 (The following patch is a combination of all the changes to cpumask.h 
 with the header from the first patch.  I'll send you a complete copy 
 in separate email.)
 
 could you please send whatever .c changes you have already, so that we 
 can have a look at how the end result will look like? Doesnt have to 
 build, i'm just curious about how it looks like in practice, 
 semantically.
 
   Ingo


I will, and the full allyesconfig does compile.  And it's basically a
benign change in that the functionality is still the same.  I'm currently
reordering it a bit to clean it up.

Thanks,
Mike
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-27 Thread Ingo Molnar

* Mike Travis [EMAIL PROTECTED] wrote:

 Hi Rusty,
 
 I've gotten some good traction on the changes in the following patch.  
 About 30% of the kernel is compiling right now and I'm picking up 
 errors and warnings as I'm going along.  I think it's doing most of 
 what we need.  Attempting to hide the cpumask struct definition caused 
 all kinds of problems with the inline functions and statically 
 declaring cpumask's.
 
 (The following patch is a combination of all the changes to cpumask.h 
 with the header from the first patch.  I'll send you a complete copy 
 in separate email.)

could you please send whatever .c changes you have already, so that we 
can have a look at how the end result will look like? Doesnt have to 
build, i'm just curious about how it looks like in practice, 
semantically.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-25 Thread Mike Travis
Rusty Russell wrote:
 On Friday 26 September 2008 01:42:13 Linus Torvalds wrote:
 On Thu, 25 Sep 2008, Rusty Russell wrote:
 This turns out to be awful in practice, mainly due to const. 
 Consider:

 #ifdef CONFIG_CPUMASK_OFFSTACK
 typedef unsigned long *cpumask_t;
 #else
 typedef unsigned long cpumask_t[1];
 #endif

 cpumask_t returns_cpumask(void);
 No. That's already broken. You cannot return a cpumask_t, regardless of
 interface. We must not do it regardless of how we pass those things
 around, since it generates _yet_ another temporary on the stack for the
 return slot for any kind of structure.
 
 No, for large NR_CPUS, cpumask_t is a pointer as shown.  And we have numerous 
 basic functions which return a cpumask_t.  Yes, this is part of the problem.
 
 What _is_ relevant is how we allocate them when we need temporary CPU
 masks. And _that_ is where my suggestion comes in. For small NR_CPUS, we
 really do want to allocate them on the stack, because calling kmalloc for
 a 4- or 8-byte allocation is just _stupid_.
 
 Right, but cpumask_t is used for far more than stack decls, thus the problems.
 
 I can make a separate cpumask_stack_t and use your method tho.  I think 
 that 
 might even reduce churn and allow us to do this in parts.
 
 which has to be converted some way. And I think it needs to be converted
 in a way that does *not* force us to call kmalloc() for idiotically small
 values.
 
 Yeah, got that.  But your suggestion to change cpumask_t turned out horribly 
 ugly.
 
 Cheers,
 Rusty.

Hi Rusty,

I've gotten some good traction on the changes in the following patch.  About 30%
of the kernel is compiling right now and I'm picking up errors and warnings as
I'm going along.  I think it's doing most of what we need.  Attempting to hide
the cpumask struct definition caused all kinds of problems with the inline
functions and statically declaring cpumask's.

(The following patch is a combination of all the changes to cpumask.h with the
header from the first patch.  I'll send you a complete copy in separate email.)

Thanks,
Mike
--

Subject: [RFC 1/1] cpumask: Provide new cpumask API

Provide new cpumask interface API.  The relevant change is basically
cpumask_t becomes an opaque object.  I believe this results in the
minimum amount of editing while still allowing the inline cpumask
functions, and the ability to declare static cpumask objects.


/* raw declaration */
struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); };

/* cpumask_map_t used for declaring static cpumask maps */
typedef struct __cpumask_data_s cpumask_map_t[1];

/* cpumask_t used for function args and return pointers */
typedef struct __cpumask_data_s *cpumask_t;

/* cpumask_var_t used for local variable, definition follows */
typedef struct __cpumask_data_s cpumask_var_t[1]; /* SMALL NR_CPUS */
typedef struct __cpumask_data_s *cpumask_var_t;   /* LARGE NR_CPUS */

/* replaces cpumask_t dst = (cpumask_t)src */
void cpus_copy(cpumask_t dst, const cpumask_t src);

Remove the '*' indirection in all references to cpumask_t objects.  You can
change the reference to the cpumask object but not the cpumask object itself
without using the functions that operate on cpumask objects (f.e. the cpu_*
operators).  Functions can return a cpumask_t (which is a pointer to the
cpumask object) and only be passed a cpumask_t.

All uses of cpumask_t on the stack are changed to be cpumask_var_t except
for pointers to static cpumask objects.  Allocation of local (temp) cpumask
objects will follow...

All cpumask operators now operate using nr_cpu_ids instead of NR_CPUS.  All
variants of the cpumask operators which used nr_cpu_ids instead of NR_CPUS
are deleted.

All variants of functions which use the (old cpumask_t *) pointer are deleted
(f.e. set_cpus_allowed_ptr()).

Based on code from Rusty Russell [EMAIL PROTECTED] (THANKS!!)

Signed-of-by: Mike Travis [EMAIL PROTECTED]

--- struct-cpumasks.orig/include/linux/cpumask.h2008-09-25 
20:40:59.303546951 -0700
+++ struct-cpumasks/include/linux/cpumask.h 2008-09-25 22:41:00.764472541 
-0700
@@ -3,7 +3,8 @@
 
 /*
  * Cpumasks provide a bitmap suitable for representing the
- * set of CPU's in a system, one bit position per CPU number.
+ * set of CPU's in a system, one bit position per CPU number up to
+ * nr_cpu_ids (= NR_CPUS).
  *
  * See detailed comments in the file linux/bitmap.h describing the
  * data type on which these cpumasks are based.
@@ -18,18 +19,6 @@
  * For details of cpus_fold(), see bitmap_fold in lib/bitmap.c.
  *
  * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- * Note: The alternate operations with the suffix _nr are used
- *   to limit the range of the loop to nr_cpu_ids instead of
- *   NR_CPUS when NR_CPUS  64 for performance reasons.
- *   If NR_CPUS is = 64 then most assembler bitmask
- *   operators execute faster with a constant range, so
- *   

Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-24 Thread Rusty Russell
On Wednesday 27 August 2008 02:51:46 Linus Torvalds wrote:
 On Tue, 26 Aug 2008, Yinghai Lu wrote:
  wonder if could use unsigned long * directly.

 I would actually suggest something like this:

  - we continue to have a magic cpumask_t.

  - we do different cases for big and small NR_CPUS:

   #if NR_CPUS = BITS_PER_LONG

   /*
* Make it an array - that way passing it as an argument will
* always pass it as a pointer!
*/
   typedef unsigned long cpumask_t[1];

   static inline void create_cpumask(cpumask_t *p)
   {
   *p = 0;
   }
   static inline void free_cpumask(cpumask_t *p)
   {
   }

   #else

   typedef unsigned long *cpumask_t;

   static inline void create_cpumask(cpumask_t *p)
   {
   *p = kcalloc(..);
   }

   static inline void free_cpumask(cpumask_t *p)
   {
   kfree(*p);
   }

   #endif

 and now after you do this, you can just do something like

   cpumask_t mycpu;

   create_cpumask(mycpu);
   ..
   free_cpumask(mycpu);

 and in between, you can use 'cpumask' as a pointer, because even when it
 is an array directly allocated on the stack, the array can always
 degenerate into a pointer by C type rules!

Hi Linus,

This turns out to be awful in practice, mainly due to const.  Consider:

#ifdef CONFIG_CPUMASK_OFFSTACK
typedef unsigned long *cpumask_t;
#else
typedef unsigned long cpumask_t[1];
#endif

cpumask_t returns_cpumask(void);

That's obviously illegal if cpumask_t is an array.  So we need a typedef which 
says really always a pointer.

typedef unsigned long *cpumask_return_t;
cpumask_return_t returns_cpumask(void);

But we usually want it to return a const ptr, and this doesn't work:

const cpumask_return_t returns_cpumask(void);
foo.c:12: warning: type qualifiers ignored on function return type

So now we need:
typedef const unsigned long *cpumask_const_return_t;
cpumask_const_return_t returns_cpumask(void);

OK, now consider a function which wants to take a const cpu bitmap:

void cpus_copy(cpumask_t dst, const cpumask_t src);
...
cpus_copy(cpus, returns_cpumask());
foo.c:34: warning: passing argument 2 of ‘cpus_copy’ discards 
qualifiers from 
pointer target type

Oops, that didn't work with the pointer version.  So we need another typedef:
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef const unsigned long *cpumask_const_t;
#else
typedef const unsigned long cpumask_const_t[1];
#endif

void cpus_copy(cpumask_t dst, cpumask_const_t src);

We end up with this:
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef unsigned long *cpumask_t;
typedef const unsigned long *cpumask_const_t;
#else
typedef unsigned long cpumask_t[1];
typedef const unsigned long cpumask_const_t[1];
#endif
typedef unsigned long *cpumask_return_t;
typedef const unsigned long *cpumask_const_return_t;
typedef unsigned long cpumask_data_t[1];

I can't see a neater way down this path, and I don't want to lose const.

I can see three alternatives:
1) An ONSTACK_CPUMASK(name) macro which declares struct cpumask name[1] or
   struct cpumask *name.  Same idea as yours, without the typedef.
2) Use a normal struct for cpumask, make everyone use pointers, but have an
   struct cpumask *alloc_stack_cpumask() which uses alloca() for small
   NR_CPUS.
3) Same, but just use kmalloc everywhere.  Optimize important cases by hand.

Anyone see a better way?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-09-01 Thread Jes Sorensen

Linus Torvalds wrote:


On Fri, 29 Aug 2008, Jes Sorensen wrote:

I have only tested this on ia64, but it boots, so it's obviously
perfecttm :-)


Well, it probably boots because it doesn't really seem to _change_ much of 
anything.


Hi Linus,

I realize that, but as I have been doing this work on ia64, I didn't
want to mess too much with the x86 code. The ia64 part of the patch does
change things :-)

If someone with more x86 knowledge would want to try and make that part
of the patch better, and more importantly test it, I'd be quite keen on
helping out.

Cheers,
Jes



Things like this:

-static inline void arch_send_call_function_ipi(cpumask_t mask)
+static inline void arch_send_call_function_ipi(cpumask_t *mask)
 {
-   smp_ops.send_call_func_ipi(mask);
+   smp_ops.send_call_func_ipi(*mask);
 }

will still do that stack allocation at the time of the call. You'd have to 
pass the thing all the way down as a pointer..


Linus



--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-29 Thread Jes Sorensen

David Miller wrote:

Otherwise you have to modify cpumask_t objects and thus pluck
them onto the stack where they take up silly amounts of space.

Yes, I had proposed either modifying, or supplementing a new
smp_call function to pass the cpumask_t as a pointer (similar
to set_cpus_allowed_ptr.)  But an ABI change such as this was
not well received at the time.


What it seems to come down to is that any cpumask_t not inside of
a dynamically allocated object should be marked const.

And that is something we can enforce at compile time.

Linus has just suggested dynamically allocating cpumask_t's
for such cases but I don't see that as the fix either.

Just mark them const and enforce that cpumask_t objects can only
be modified when they appear in dynamically allocated objects.


Dave and others,

Sorry if I jump into the middle of the thread. Stopped subscribing to
lkml for a while, so this is through the archives.

Ran into some of these issues with KVM too, and noticed just how much
we pass cpumask_t around in the smp_call functions :-( In fact, the
only arch that did pretty well on this front was sparc64.

I totally agree, that marking them const makes a ton of sense, but at
the same time I suggest we convert smp_call_function_mask() to take a
pointer to the cpumask_t. I whipped up the following patch, which cuts
down the amont of memcpy calls emitted quite a fair bit.

I have only tested this on ia64, but it boots, so it's obviously
perfecttm :-)

Comments, suggestions welcome.

I have a followup patch that makes virt/kvm/kvm_main.c use the new
interface.

Cheers,
Jes

Change smp_call_function_mask() to take a pointer to the cpumask_t
rather than passing it by value. This avoids recursive copies of the
cpumask_t on the stack in the IPI call. For large NR_CPUS, this is
particularly bad, and the cost of doing this for
NR_CPUS  bits_per_long is negligeble.

Signed-off-by: Jes Sorensen [EMAIL PROTECTED]

---
 arch/alpha/include/asm/smp.h|2 +-
 arch/alpha/kernel/smp.c |4 ++--
 arch/arm/include/asm/smp.h  |2 +-
 arch/arm/kernel/smp.c   |4 ++--
 arch/ia64/include/asm/smp.h |2 +-
 arch/ia64/kernel/smp.c  |6 +++---
 arch/m32r/kernel/smp.c  |4 ++--
 arch/mips/kernel/smp.c  |4 ++--
 arch/parisc/kernel/smp.c|6 +++---
 arch/powerpc/include/asm/smp.h  |2 +-
 arch/powerpc/kernel/smp.c   |4 ++--
 arch/sh/include/asm/smp.h   |2 +-
 arch/sh/kernel/smp.c|4 ++--
 arch/sparc/include/asm/smp_64.h |2 +-
 arch/sparc64/kernel/smp.c   |4 ++--
 include/asm-m32r/smp.h  |2 +-
 include/asm-mips/smp.h  |2 +-
 include/asm-parisc/smp.h|2 +-
 include/asm-x86/smp.h   |4 ++--
 include/linux/smp.h |2 +-
 kernel/smp.c|   15 ---
 virt/kvm/kvm_main.c |4 ++--
 22 files changed, 42 insertions(+), 41 deletions(-)

Index: linux-2.6.git/arch/alpha/include/asm/smp.h
===
--- linux-2.6.git.orig/arch/alpha/include/asm/smp.h
+++ linux-2.6.git/arch/alpha/include/asm/smp.h
@@ -48,7 +48,7 @@
 #define cpu_possible_map   cpu_present_map
 
 extern void arch_send_call_function_single_ipi(int cpu);
-extern void arch_send_call_function_ipi(cpumask_t mask);
+extern void arch_send_call_function_ipi(cpumask_t *mask);
 
 #else /* CONFIG_SMP */
 
Index: linux-2.6.git/arch/alpha/kernel/smp.c
===
--- linux-2.6.git.orig/arch/alpha/kernel/smp.c
+++ linux-2.6.git/arch/alpha/kernel/smp.c
@@ -637,9 +637,9 @@
send_ipi_message(to_whom, IPI_CPU_STOP);
 }
 
-void arch_send_call_function_ipi(cpumask_t mask)
+void arch_send_call_function_ipi(cpumask_t *mask)
 {
-   send_ipi_message(mask, IPI_CALL_FUNC);
+   send_ipi_message(*mask, IPI_CALL_FUNC);
 }
 
 void arch_send_call_function_single_ipi(int cpu)
Index: linux-2.6.git/arch/arm/include/asm/smp.h
===
--- linux-2.6.git.orig/arch/arm/include/asm/smp.h
+++ linux-2.6.git/arch/arm/include/asm/smp.h
@@ -102,7 +102,7 @@
 extern void platform_cpu_enable(unsigned int cpu);
 
 extern void arch_send_call_function_single_ipi(int cpu);
-extern void arch_send_call_function_ipi(cpumask_t mask);
+extern void arch_send_call_function_ipi(cpumask_t *mask);
 
 /*
  * Local timer interrupt handling function (can be IPI'ed).
Index: linux-2.6.git/arch/arm/kernel/smp.c
===
--- linux-2.6.git.orig/arch/arm/kernel/smp.c
+++ linux-2.6.git/arch/arm/kernel/smp.c
@@ -356,9 +356,9 @@
local_irq_restore(flags);
 }
 
-void arch_send_call_function_ipi(cpumask_t mask)
+void arch_send_call_function_ipi(cpumask_t *mask)
 {
-   send_ipi_message(mask, IPI_CALL_FUNC);
+   send_ipi_message(*mask, IPI_CALL_FUNC);
 }
 
 void 

Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-29 Thread Linus Torvalds


On Fri, 29 Aug 2008, Jes Sorensen wrote:
 
 I have only tested this on ia64, but it boots, so it's obviously
 perfecttm :-)

Well, it probably boots because it doesn't really seem to _change_ much of 
anything.

Things like this:

-static inline void arch_send_call_function_ipi(cpumask_t mask)
+static inline void arch_send_call_function_ipi(cpumask_t *mask)
 {
-   smp_ops.send_call_func_ipi(mask);
+   smp_ops.send_call_func_ipi(*mask);
 }

will still do that stack allocation at the time of the call. You'd have to 
pass the thing all the way down as a pointer..

Linus


--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-29 Thread David Miller
From: Linus Torvalds [EMAIL PROTECTED]
Date: Fri, 29 Aug 2008 09:14:44 -0700 (PDT)

 Well, it probably boots because it doesn't really seem to _change_ much of 
 anything.
 
 Things like this:
 
   -static inline void arch_send_call_function_ipi(cpumask_t mask)
   +static inline void arch_send_call_function_ipi(cpumask_t *mask)
{
   -   smp_ops.send_call_func_ipi(mask);
   +   smp_ops.send_call_func_ipi(*mask);
}
 
 will still do that stack allocation at the time of the call. You'd have to 
 pass the thing all the way down as a pointer..

True, but we have to get there one step at a time.

BTW, sparc64 already wants a pointer here, so it's completely ready for
this:

void arch_send_call_function_ipi(cpumask_t mask)
{
xcall_deliver((u64) xcall_call_function, 0, 0, mask);
}
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-28 Thread Adrian Bunk
On Thu, Aug 28, 2008 at 09:32:13AM +0900, Paul Mundt wrote:
 On Wed, Aug 27, 2008 at 08:35:44PM +0300, Adrian Bunk wrote:
  On Thu, Aug 28, 2008 at 01:00:52AM +0900, Paul Mundt wrote:
   On Wed, Aug 27, 2008 at 02:58:30PM +0300, Adrian Bunk wrote:
   In addition to that, debugging the runaway stack users on 4k tends to be
   easier anyways since you end up blowing the stack a lot sooner. On sh
   we've had pretty good luck with it, though most of our users are using
   fairly deterministic workloads and continually profiling the footprint.
   Anything that runs away or uses an insane amount of stack space needs to
   be fixed well before that anyways, so catching it sooner is always
   preferable. I imagine the same case is true for m68knommu (even sans IRQ
   stacks).
  
  CONFIG_DEBUG_STACKOVERFLOW should give you the same information, and if
  wanted with an arbitrary limit.
  
 In some cases, yes. In the CONFIG_DEBUG_STACKOVERFLOW case the check is
 only performed from do_IRQ(), which is sporadic at best, especially on
 tickless. While it catches some things, it's not a complete solution in
 and of iteslf.
 
 In addition to this, there are even fewer platforms that support it than
 there are platforms that do 4k stacks. At first glance, it looks like
 it's only m32r, powerpc, sh, x86, and xtensa.
...

As far as I can see the only architectures that optionally offer 4kB 
stacks today are m68knommu, s390, sh and x86.

Did I miss some architectures or is 5  4 ;) ?

 Others support the Kconfig
 option, but don't seem to realize that it's not an option that the kernel
 does anything with by itself, and so don't actually do anything (ie,
 FRV).

Unless I miss anything these others include only FRV.

  IMHO there seems to currently be a mismatch between it's maintainance 
  cost and the actual number of users. That's in my opinion the main 
  problem with it, no matter in which direction it gets resolved.
  
 Perhaps that's true on x86, but in general I take issue with that. On sh
 we've had to do very little maintenance for it and most shipping products
 are using it today (at least on MMU-Linux, we don't bother with it on
 nommu). Most of the problems we ran in to with 4k stacks tended to be
 stuff that we wanted to fix for 8k anyways. I suspect that this case is
 true for the other embedded platforms also.
...

Most stack issues are not platform or architecture specific.

The maintainance effort therefore mostly depends on whether a non-zero 
number of architectures uses 4kB stacks.

And if something is considered to be important for small embedded 
systems, but not supported on ARM, MIPS or PowerPC, then that's 
a bit strange.

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Paul Mackerras
Linus Torvalds writes:

 4kB used to be the _only_ choice. And no, there weren't even irq stacks. 
 So that 4kB was not just the whole kernel call-chain, it was also all the 
 irq nesting above it.

I think your memory is failing you.  In 2.4 and earlier, the kernel
stack was 8kB minus the size of the task_struct, which sat at the
start of the 8kB.  For instance, from include/asm-i386/processor.h for
2.4.29:

#define THREAD_SIZE (2*PAGE_SIZE)
#define alloc_task_struct() ((struct task_struct *) 
__get_free_pages(GFP_KERNEL,1))
#define free_task_struct(p) free_pages((unsigned long) (p), 1)

Paul.
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Nick Piggin
On Wednesday 27 August 2008 06:01, Mike Travis wrote:
 Dave Jones wrote:
 ...

  But yes, for this to be even remotely feasible, there has to be a
  negligable performance cost associated with it, which right now, we
  clearly don't have. Given that the number of people running 4096 CPU
  boxes even in a few years time will still be tiny, punishing the common
  case is obviously absurd.
 
  Dave

 I did do some fairly extensive benchmarking between configs of NR_CPUS =
 128 and 4096 and most performance hits were in the neighborhood of  5% on
 systems with 8 cpus and 4GB of memory (our most common test system).

5% is a pretty nasty performance hit... what sort of benchmarks are we
talking about here?

I just made some pretty crazy changes to the VM to get only around 5
or so % performance improvement in some workloads.

What places are making heavy use of cpumasks that causes such a slowdown?
Hopefully callers can mostly be improved so they don't need to use cpumasks
for common cases.

Until then, it would be kind of sad for a distro to ship a generic x86
kernel and lose 5% performance because it is set to 4096 CPUs...

But if I misunderstand and you're talking about specific microbenchmarks to
find the worst case for huge cpumasks, then I take that back.


 [But 
 changing cpumask_t's to be pointers instead of values will likely increase
 this.]  I've tried to be very sensitive to this issue with all my previous
 changes, so convincing the distros to set NR_CPUS=4096 would be as painless
 for them as possible. ;-)

 Btw, huge count cpu systems I don't think are that far away.  I believe the
 nextgen Larabbee chips will be geared towards HPC applications [instead of
 just GFX apps], and putting 4 of these chips on a motherboard would add up
 to 512 cpu threads (1024 if they support hyperthreading.)

It would be quite interesting if they make them cache coherent / MP capable.
Will they be?
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Nick Piggin
On Wednesday 27 August 2008 17:05, David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 27 Aug 2008 16:54:32 +1000

  5% is a pretty nasty performance hit... what sort of benchmarks are we
  talking about here?
 
  I just made some pretty crazy changes to the VM to get only around 5
  or so % performance improvement in some workloads.
 
  What places are making heavy use of cpumasks that causes such a slowdown?
  Hopefully callers can mostly be improved so they don't need to use
  cpumasks for common cases.

 It's almost certainly from the cross-call dispatch call chain.

 As just one example, just to do a TLB flush mm-cpu_vm_mask probably
 gets passed around as an aggregate two or three times on the way down
 to the APIC programming code on x86.  That's two or three 512 byte
 copies on the stack :)

Yeah, I see. That's stupid isn't it? (Well, I guess it was completely
sane when cpumasks were word sized ;))

Hopefully that accounts for a significant chunk...
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Bernd Petrovitsch
On Tue, 2008-08-26 at 18:54 -0400, Parag Warudkar wrote:
 On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds
 [EMAIL PROTECTED] wrote:
 
  And embedded people (the ones that might care about 1% code size) are the
  ones that would also want smaller stacks even more!
 
 This is something I never understood - embedded devices are not going
 to run more than a few processes and 4K*(Few Processes)
  IMHO is not worth a saving now a days even in embedded world given
 falling memory prices. Or do I misunderstand?

Falling prices are no reason to increase the amount of available RAM (or
other hardware).
Especially if you (intend to) build 1E5 devices - where every Euro
counts.

Bernd
-- 
Firmix Software GmbH   http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
  Embedded Linux Development and Services


--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Alan Cox
 You have a good point that aiming at 4kB makes 8kB a very safe choice.

Not really no - we use separate IRQ stacks in 4K but not 8K mode on
x86-32. That means you've actually got no more space if you are unlucky
with the timing of events. The 8K mode is merely harder to debug.

If 4K stacks really are not safe then x86-32 really really needs to
switch to using IRQ stacks in 8K stack mode as well.

Alan
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Bernd Petrovitsch
On Tue, 2008-08-26 at 22:16 -0400, Parag Warudkar wrote:
[...]
 Well, sure  - but the industry as a whole seems to have gone the other

The industry as a whole doesn't exist on that low level. You can't
compare the laptop and/or desktop computer market (where one may buy
today hardware that runs in 3 years with the next generation/release of
the OS and applications) with the e.g. WLAN router market where - from
the commercial point of view - every Euro counts (and where the
requirements for the lifetime of the device are long frozen before the
thing gets in a shop).

 way - do more with more at the similar or lower price points!
 By that definition of less is better we should try and make the kernel
 memory pageable (or has someone already done that?) - Windows does it,

That doesn't help as in really small devices (like WLAN routers, cable
modems, etc.) you run without any means of paging/swapping. And even
binaries/read-only files are not necessarily executable in place (but
must be loaded into RAM). So you can't flush these pages.

And pageable kernel memory doesn't come for free - even if one only
counts the increased code and it's complexity.

 by default ;)

Which is more a sign that it is probably a very bad idea.

Bernd
-- 
Firmix Software GmbH   http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
  Embedded Linux Development and Services


--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread David Miller
From: Nick Piggin [EMAIL PROTECTED]
Date: Wed, 27 Aug 2008 17:47:14 +1000

 Yeah, I see. That's stupid isn't it? (Well, I guess it was completely
 sane when cpumasks were word sized ;))
 
 Hopefully that accounts for a significant chunk...

There is a lot of indirect costs that are hard to see as well.

Two things a lot of these cross-call dispatch paths do is:

1) Clear self-cpu

2) AND with cpus_online

#1 can normally be a simple bit clear, but some places can also
implement this with something like cpus_andn(X, cpumask_of_cpu(cpu))

It's simply easier to move those two things down to the bottom of
the APIC programming code, they just loop over the cpumask doing
an expensive APIC I/O operation anyways, might as well overlap it
with these skip self-cpu and skip not-online cpus checks.

And oh yeah we get the stack wastage fixed too, isn't what what we
were talking about? :-)

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Alan Cox
 What about deep call chains? The problem with the uptake of 4K stacks
 seems to be that is not reliably provable that it will work under all
 circumstances.

On x86-32 with 8K stacks your IRQ paths share them so that is even harder
to prove (not that you can prove any of them) and the bugs are more
obscure and random.
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Adrian Bunk
On Tue, Aug 26, 2008 at 05:28:37PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 27 Aug 2008, Adrian Bunk wrote:
  
  When did we get callpaths like like nfs+xfs+md+scsi reliably 
  working with 4kB stacks on x86-32?
 
 XFS may never have been usable, but the rest, sure.
 
 And you seem to be making this whole argument an excuse to SUCK, adn an 
 excuse to let gcc crap even more on our stack space.
 
 Why?
 
 Why aren't you saying that we should be able to do better? Instead, you 
 seem to asking us to do even worse than we do now?

My main point is:
- getting 4kB stacks working reliably is a hard task
- having an eye on gcc increasing the stack usage, and fixing it if
  required, is relatively easy

If we should be able to do better at getting (and keeping) 4kB stacks 
working, then coping with possible inlining problems caused by gcc
should not be a big problem for us.

   Linus

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Parag Warudkar
On Wed, Aug 27, 2008 at 4:25 AM, Alan Cox [EMAIL PROTECTED] wrote:
 You have a good point that aiming at 4kB makes 8kB a very safe choice.

 Not really no - we use separate IRQ stacks in 4K but not 8K mode on
 x86-32. That means you've actually got no more space if you are unlucky
 with the timing of events. The 8K mode is merely harder to debug.


By your logic though, XFS on x86 should work fine with 4K stacks -
many will attest that it does not and blows up due to stack issues.

I have first hand experiences of things blowing up with deep call
chains when using 4K stacks where 8K worked just fine on same
workload.

So there is definitely some other problem with 4K stacks.

Thanks
Parag
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Parag Warudkar
On Wed, Aug 27, 2008 at 5:00 AM, Bernd Petrovitsch [EMAIL PROTECTED] wrote:


 They probably gave the idea pretty soon because you need to
 rework/improve large parts of the kernel + drivers (and that has two
 major problems - it consumes a lot of man power for no new features and
 everything must be completely tested again[0] and it adds new risks).
 And that is practically impossible if one sells stable driver APIs for
 3rd party (commercial) drivers because these must be changed too.


But not many embedded Linux arches support 4K stacks like Adrian
pointed out earlier.
So the same (lot of man power requirement) would apply to Linux.

Sure it will be good - but how reasonable it is to attempt it and how
reliably it will work under all conceived loads - those are the
questions.

Thanks

Parag
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Bernd Petrovitsch

On Wed, 2008-08-27 at 08:56 -0400, Parag Warudkar wrote:
 On Wed, Aug 27, 2008 at 5:00 AM, Bernd Petrovitsch [EMAIL PROTECTED] wrote:
  They probably gave the idea pretty soon because you need to
  rework/improve large parts of the kernel + drivers (and that has two
  major problems - it consumes a lot of man power for no new features and
  everything must be completely tested again[0] and it adds new risks).
  And that is practically impossible if one sells stable driver APIs for
  3rd party (commercial) drivers because these must be changed too.
 
 But not many embedded Linux arches support 4K stacks like Adrian

What is an embedded Linux arch?
Personally I encountered i386, ARM, MIPS and PPC in the embedded world.

 pointed out earlier.
 So the same (lot of man power requirement) would apply to Linux.

Of course. Look at the amount of work done by lots of people in that
area (including stack frame size reductions) and on-going discussions.

 Sure it will be good - but how reasonable it is to attempt it and how
 reliably it will work under all conceived loads - those are the
 questions.

If you develop an embedded system (which is partly system integration
of existing apps) to be installed in the field, you don't have that many
conceivable work loads compared to a desktop/server system. And you have
a fixed list of drivers and applications.
A usual approach is to run stress tests on several (or all)
subsystems/services/... in parallel and if the device survives it
functioning correctly, it is at least good enough.

Bernd
-- 
Firmix Software GmbH   http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
  Embedded Linux Development and Services


--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Alan Cox
 By your logic though, XFS on x86 should work fine with 4K stacks -
 many will attest that it does not and blows up due to stack issues.
 
 I have first hand experiences of things blowing up with deep call
 chains when using 4K stacks where 8K worked just fine on same
 workload.
 
 So there is definitely some other problem with 4K stacks.

Nothing of the sort. If it blows up with a 4K stack it will almost
certainly blow up with an 8K stack *eventually* - when a heavy stack usage
coincides with a heavy stack using IRQ handler.

You won't catch it in simple testing, you won't catch it in trivial
simulation and it'll be incredibly hard to reproduce. Not the kind of bug
you want in a production system really. IRQ stacks make things much more
predictable.

Alan
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Mike Travis
David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 27 Aug 2008 16:54:32 +1000
 
 5% is a pretty nasty performance hit... what sort of benchmarks are we
 talking about here?

 I just made some pretty crazy changes to the VM to get only around 5
 or so % performance improvement in some workloads.

 What places are making heavy use of cpumasks that causes such a slowdown?
 Hopefully callers can mostly be improved so they don't need to use cpumasks
 for common cases.
 
 It's almost certainly from the cross-call dispatch call chain.
 
 As just one example, just to do a TLB flush mm-cpu_vm_mask probably
 gets passed around as an aggregate two or three times on the way down
 to the APIC programming code on x86.  That's two or three 512 byte
 copies on the stack :)
 
 Look at the sparc64 SMP code for how I solved the problem there.

I will, thanks!

Mike
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Mike Travis
David Miller wrote:
 From: Nick Piggin [EMAIL PROTECTED]
 Date: Wed, 27 Aug 2008 17:47:14 +1000
 
 Yeah, I see. That's stupid isn't it? (Well, I guess it was completely
 sane when cpumasks were word sized ;))

 Hopefully that accounts for a significant chunk...
 
 There is a lot of indirect costs that are hard to see as well.
 
 Two things a lot of these cross-call dispatch paths do is:
 
 1) Clear self-cpu
 
 2) AND with cpus_online
 
 #1 can normally be a simple bit clear, but some places can also
 implement this with something like cpus_andn(X, cpumask_of_cpu(cpu))
 
 It's simply easier to move those two things down to the bottom of
 the APIC programming code, they just loop over the cpumask doing
 an expensive APIC I/O operation anyways, might as well overlap it
 with these skip self-cpu and skip not-online cpus checks.
 
 And oh yeah we get the stack wastage fixed too, isn't what what we
 were talking about? :-)

Yes, the most time consuming part was determining whether a kmalloc
could safely be used in the context of the function, and what to
do about the out-of-memory problem.  Pushing that down to something
like:  for_each_cpu_thats_online(cpu, *maskptr) would remove the need for
many of the temp masks.  A simple if (cpu != me) would take care of
excluding self.  It might have better interaction with cpu hotplug
as well, since the online map would be checked just before the call
to that cpu is made.

Thanks,
Mike
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Linus Torvalds


On Wed, 27 Aug 2008, Paul Mackerras wrote:
 
 I think your memory is failing you.  In 2.4 and earlier, the kernel
 stack was 8kB minus the size of the task_struct, which sat at the
 start of the 8kB.

Yup, you're right.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Jamie Lokier
Bernd Petrovitsch wrote:
 If you develop an embedded system (which is partly system integration
 of existing apps) to be installed in the field, you don't have that many
 conceivable work loads compared to a desktop/server system. And you have
 a fixed list of drivers and applications.

Hah!  Not in my line of embedded device.

32MB no-MMU ARM boards which people run new things and attach new
devices to rather often - without making new hardware.  Volume's too
low per individual application to get new hardware designed and made.

I'm seriously thinking of forwarding porting the 4 year old firmware
from 2.4.26 to 2.6.current, just to get new drivers and capabilities.
Backporting is tedious, so's feeling wretchedly far from the mainline
world.

 A usual approach is to run stress tests on several (or all)
 subsystems/services/... in parallel and if the device survives it
 functioning correctly, it is at least good enough.

Per application.

Some little devices run hundreds of different applications and
customers expect to customise, script themselves, and attach different
devices (over USB).  The next customer in the chain expects the bits
you supplied to work in a variety of unexpected situations, even when
you advise that it probably won't do that.

Much like desktop/server Linux, but on a small device where silly
little things like 'create a process' are a stress for the dear little
thing.

(My biggest lesson: insist on an MMU next time!)

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Paul Mundt
On Wed, Aug 27, 2008 at 02:58:30PM +0300, Adrian Bunk wrote:
 On Tue, Aug 26, 2008 at 05:28:37PM -0700, Linus Torvalds wrote:
  On Wed, 27 Aug 2008, Adrian Bunk wrote:
   
   When did we get callpaths like like nfs+xfs+md+scsi reliably 
   working with 4kB stacks on x86-32?
  
  XFS may never have been usable, but the rest, sure.
  
  And you seem to be making this whole argument an excuse to SUCK, adn an 
  excuse to let gcc crap even more on our stack space.
  
  Why?
  
  Why aren't you saying that we should be able to do better? Instead, you 
  seem to asking us to do even worse than we do now?
 
 My main point is:
 - getting 4kB stacks working reliably is a hard task
 - having an eye on gcc increasing the stack usage, and fixing it if
   required, is relatively easy
 
 If we should be able to do better at getting (and keeping) 4kB stacks 
 working, then coping with possible inlining problems caused by gcc
 should not be a big problem for us.
 
Out of the architectures you've mentioned for 4k stacks, they also tend
to do IRQ stacks, which is something you seem to have overlooked.

In addition to that, debugging the runaway stack users on 4k tends to be
easier anyways since you end up blowing the stack a lot sooner. On sh
we've had pretty good luck with it, though most of our users are using
fairly deterministic workloads and continually profiling the footprint.
Anything that runs away or uses an insane amount of stack space needs to
be fixed well before that anyways, so catching it sooner is always
preferable. I imagine the same case is true for m68knommu (even sans IRQ
stacks).

Things might be more sensitive on x86, but it's certainly not something
that's a huge problem for the various embedded platforms to wire up,
whether they want to go the IRQ stack route or not.

In any event, lack of support for something on embedded architectures in
the kernel is more often due to apathy/utter indifference on the part of
the architecture maintainer rather than being indicative of any intrinsic
difficulty in supporting the thing in question. Most new features on the
lesser maintained architectures tend to end up there either out of peer
pressure or copying-and-pasting accidents rather than any sort of design.
;-)
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Jamie Lokier
Linus Torvalds wrote:
  Most LOCs of the kernel are not written by people like you or Al Viro or 
  David Miller, and the average kernel developer is unlikely to do it as 
  good as gcc.
 
 Sure. But we do have tools. We do have checkstack.pl, it's just that it 
 hasn't been an issue in a long time, so I suspect many people didn't even 
 _realize_ we have it, and I certainly can attest to the fact that even 
 people who remember it - like me - don't actually tend to run it all that 
 often.

Sounds like what's really desired here isn't more worry and
unpredictability, but for GCC+Binutils to gain the ability to
calculate the stack depth over all callchains (doesn't have to be
exact, just an upper bound; annotate recursions) in a way that's good
enough to do on every compile, complain if a depth is exceeded
statically (or it can't be proven), and to gain the
architecture-independent option optimise to reduce stack usage.

  BTW:
  I just ran checkstack on a (roughly) allyesconfig kernel, and we have a 
  new driver that allocates unsigned char recvbuf[1500]; on the stack...
 
 Yeah, it's _way_ too easy to do bad things.

In my userspace code, I have macros tmp_alloc and tmp_free.  They must
be matched in the same function:

 unsigned char * recvbuf = tmp_alloc(1500);
 
 tmp_free(recvbuf);

When stack is plentiful, it maps to alloca() which is roughly
equivalent to using a stack variable.

When stack is constrained (as it is on my little devices), that maps
to xmalloc/free.  The kernel equivalent would be kmalloc GFP_ATOMIC
(perhaps).

With different macros to mine, it may be possible to map small
fixed-size requests exactly onto local variables, and large ones to
kmalloc().  A stab at it (not tested):

#define LOCAL_ALLOC_THRESHOLD 128

#define LOCAL_ALLOC(type, ptr)\
__typeof__(type) __attribute__((__unused__)) ptr##_local_struct;  \
__typeof__(type) * ptr =  \
  ((__builtin_constant_p(sizeof(type))\
 sizeof(type) = LOCAL_ALLOC_THRESHOLD) \
   ? ptr##_local_struct : kmalloc(sizeof(type), GFP_ATOMIC))

#define LOCAL_FREE(ptr)   \
((__builtin_constant_p(sizeof (*(ptr)))   \
   sizeof(*(ptr)) = LOCAL_ALLOC_THRESHOLD) \
 ? (void) 0 : kfree(ptr))

Would that be useful in the kernel?

I'm thinking if it were a commonly used pattern for temporary buffers,
unknown structures and arrays of macro-determined size, the new
driver author would be less likely to accidentally drop a big object
on the stack.

Obviously it would be nicer for GCC to code such a thing
automatically, but that really is wishful thinking.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Parag Warudkar
On Wed, Aug 27, 2008 at 9:21 AM, Alan Cox [EMAIL PROTECTED] wrote:
 By your logic though, XFS on x86 should work fine with 4K stacks -
 many will attest that it does not and blows up due to stack issues.

 I have first hand experiences of things blowing up with deep call
 chains when using 4K stacks where 8K worked just fine on same
 workload.

 So there is definitely some other problem with 4K stacks.

 Nothing of the sort. If it blows up with a 4K stack it will almost
 certainly blow up with an 8K stack *eventually* - when a heavy stack usage
 coincides with a heavy stack using IRQ handler.

 You won't catch it in simple testing, you won't catch it in trivial
 simulation and it'll be incredibly hard to reproduce. Not the kind of bug
 you want in a production system really. IRQ stacks make things much more
 predictable.


I see - so if I end up having a workload on 8k where heavy stack using
IRQs and deep kernel call chains come at the same time - even 8K will
blow up.
So 4K will blow too except that it doesn't require IRQs also to use
heavy stack, just XFS is good enough :)

It then seems like the IRQs using lot of stack is not so much of a
problem in the current kernel as much as deeper call chains and stack
usage of normal non-irq path code is.
So 8k makes it possible for the deeper call chains of non-irq path to
survive since they get better part of the 8K to themselves and IRQs
can do with less almost always.

At least that's what I can derive from the fact that we do not have
lots of reports of 8K stack blowing up.

Thanks

Parag
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Bernd Petrovitsch
On Wed, 2008-08-27 at 16:48 +0100, Jamie Lokier wrote:
 Bernd Petrovitsch wrote:
  If you develop an embedded system (which is partly system integration
  of existing apps) to be installed in the field, you don't have that many
  conceivable work loads compared to a desktop/server system. And you have
  a fixed list of drivers and applications.
 
 Hah!  Not in my line of embedded device.
 
 32MB no-MMU ARM boards which people run new things and attach new
 devices to rather often - without making new hardware.  Volume's too
 low per individual application to get new hardware designed and made.

Yes, you may have several products on the same hardware with somewhat
differing requirements (or not). But that is much less than a general
purpose system IMHO.

 I'm seriously thinking of forwarding porting the 4 year old firmware
 from 2.4.26 to 2.6.current, just to get new drivers and capabilities.

That sounds reasonable (and I never meant maintaining the old system
infinitely. Actually once the thing is shipped it usually enters deep
maintenance mode and the next is more a fork from the old).

 Backporting is tedious, so's feeling wretchedly far from the mainline
 world.

ACK. But that also depends on amount local changes (and sorry, but not
all locally necessary patches would be accepted in mainline in any way).

  A usual approach is to run stress tests on several (or all)
  subsystems/services/... in parallel and if the device survives it
  functioning correctly, it is at least good enough.
 
 Per application.
 
 Some little devices run hundreds of different applications and
 customers expect to customise, script themselves, and attach different
 devices (over USB).  The next customer in the chain expects the bits
 you supplied to work in a variety of unexpected situations, even when
 you advise that it probably won't do that.

Basically their problem. Yes, they actually think they get a Linux
system where they can do everything and it simply works.

Oh, that's obviously not a usual WLAN-router style of product (where
you are not expected to actually login on a console or per ssh).

 Much like desktop/server Linux, but on a small device where silly
 little things like 'create a process' are a stress for the dear little
 thing.
 
 (My biggest lesson: insist on an MMU next time!)

ACK. We avoid MMU-less hardware too - especially since there is enough
hardware with a MMU around.

Bernd
-- 
Firmix Software GmbH   http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
  Embedded Linux Development and Services


--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Jamie Lokier
Bernd Petrovitsch wrote:
  32MB no-MMU ARM boards which people run new things and attach new
  devices to rather often - without making new hardware.  Volume's too
  low per individual application to get new hardware designed and made.
 
 Yes, you may have several products on the same hardware with somewhat
 differing requirements (or not). But that is much less than a general
 purpose system IMHO.

It is, but the idea that small embedded systems go through a 'all
components are known, drivers are known, test and if it passes it's
shippable' does not always apply.

  I'm seriously thinking of forwarding porting the 4 year old firmware
  from 2.4.26 to 2.6.current, just to get new drivers and capabilities.
 
 That sounds reasonable (and I never meant maintaining the old system
 infinitely.

Sounds reasonable, but it's vetoed for anticipated time and cost,
compared with backporting on demand.  Fair enough, since 2.6.current
doesn't support ARM no-MMU last I heard ('soon'?).

On the other hand, the 2.6 anti-fragmentation patches, including
latest SLUB stuff, ironically meant to help big machines, sound really
appealing for my current problem and totally unrealistic to
backport...

 ACK. We avoid MMU-less hardware too - especially since there is enough
 hardware with a MMU around.

I can't emphasise enough how much difference MMU makes to Linux userspace.

It's practically: MMU = standard Linux (with less RAM), have everything.
No-MMU = lots of familiar 'Linux' things not available or break.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Bernd Petrovitsch
On Mit, 2008-08-27 at 18:51 +0100, Jamie Lokier wrote:
 Bernd Petrovitsch wrote:
[...]
 It is, but the idea that small embedded systems go through a 'all
 components are known, drivers are known, test and if it passes it's
 shippable' does not always apply.

Not always but often enough. And yes, there is ARM-based embedded
hardware with 1GB Flash-RAM and 128MB RAM.

   I'm seriously thinking of forwarding porting the 4 year old firmware
   from 2.4.26 to 2.6.current, just to get new drivers and capabilities.
  
  That sounds reasonable (and I never meant maintaining the old system
  infinitely.
 
 Sounds reasonable, but it's vetoed for anticipated time and cost,

That is to be expected;-)

[]
  ACK. We avoid MMU-less hardware too - especially since there is enough
  hardware with a MMU around.
 
 I can't emphasise enough how much difference MMU makes to Linux userspace.
 
 It's practically: MMU = standard Linux (with less RAM), have everything.
 No-MMU = lots of familiar 'Linux' things not available or break.

ACK. And tell that a customer that everything is more effort and more
risk and not just simply cross-compile it as it runs on my desktop
too.

Bernd
-- 
Firmix Software GmbH   http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
  Embedded Linux Development and Services

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Paul Mundt
On Wed, Aug 27, 2008 at 05:46:05PM -0700, David Miller wrote:
 From: Paul Mundt [EMAIL PROTECTED]
 Date: Thu, 28 Aug 2008 09:32:13 +0900
 
  On Wed, Aug 27, 2008 at 08:35:44PM +0300, Adrian Bunk wrote:
   CONFIG_DEBUG_STACKOVERFLOW should give you the same information, and if
   wanted with an arbitrary limit.
 
  In some cases, yes. In the CONFIG_DEBUG_STACKOVERFLOW case the check is
  only performed from do_IRQ(), which is sporadic at best, especially on
  tickless. While it catches some things, it's not a complete solution in
  and of iteslf.
 
 BTW, on sparc64 we have a stack overflow checker that runs via
 the profiling _mcount hook.  So every function call we check
 if the stack is getting overused.
 
 If so, we jump onto a special static debugging stack and print
 the stack overflow message.
 
 And yes it works with IRQ stacks which is all that sparc64 uses
 nowadays.
 
 Perhaps this is useful enough to make generic.

Thanks for the pointer, I'll take a look at it!
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-27 Thread Greg Ungerer


Paul Mundt wrote:

On Wed, Aug 27, 2008 at 02:58:30PM +0300, Adrian Bunk wrote:

On Tue, Aug 26, 2008 at 05:28:37PM -0700, Linus Torvalds wrote:

On Wed, 27 Aug 2008, Adrian Bunk wrote:
When did we get callpaths like like nfs+xfs+md+scsi reliably 
working with 4kB stacks on x86-32?

XFS may never have been usable, but the rest, sure.

And you seem to be making this whole argument an excuse to SUCK, adn an 
excuse to let gcc crap even more on our stack space.


Why?

Why aren't you saying that we should be able to do better? Instead, you 
seem to asking us to do even worse than we do now?

My main point is:
- getting 4kB stacks working reliably is a hard task
- having an eye on gcc increasing the stack usage, and fixing it if
  required, is relatively easy

If we should be able to do better at getting (and keeping) 4kB stacks 
working, then coping with possible inlining problems caused by gcc

should not be a big problem for us.


Out of the architectures you've mentioned for 4k stacks, they also tend
to do IRQ stacks, which is something you seem to have overlooked.

In addition to that, debugging the runaway stack users on 4k tends to be
easier anyways since you end up blowing the stack a lot sooner. On sh
we've had pretty good luck with it, though most of our users are using
fairly deterministic workloads and continually profiling the footprint.
Anything that runs away or uses an insane amount of stack space needs to
be fixed well before that anyways, so catching it sooner is always
preferable. I imagine the same case is true for m68knommu (even sans IRQ
stacks).


Yep, definitely true for m68knommu in my experience. I haven't had
any problems with 4k stacks recently. But yes the workloads do tend
to be constrained - and almost never use any of the more exotic
filesystems or drivers.




Things might be more sensitive on x86, but it's certainly not something
that's a huge problem for the various embedded platforms to wire up,
whether they want to go the IRQ stack route or not.

In any event, lack of support for something on embedded architectures in
the kernel is more often due to apathy/utter indifference on the part of
the architecture maintainer rather than being indicative of any intrinsic
difficulty in supporting the thing in question. Most new features on the
lesser maintained architectures tend to end up there either out of peer
pressure or copying-and-pasting accidents rather than any sort of design.
;-)


Indeed :-)

Regards
Greg



Greg Ungerer  --  Chief Software Dude   EMAIL: [EMAIL PROTECTED]
Secure Computing CorporationPHONE:   +61 7 3435 2888
825 Stanley St, FAX: +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 On Mon, 25 Aug 2008, Linus Torvalds wrote:
  
  checkstack.pl shows these things as the top problems:
  
  0x80266234 smp_call_function_mask [vmlinux]:2736
  0x80234747 __build_sched_domains [vmlinux]: 2232
  0x8023523f __build_sched_domains [vmlinux]: 2232
  
  Anyway, the reason smp_call_function_mask and friends have such _huge_ 
  stack usages for you is that they contain a 'cpumask_t' on the stack.
 
 In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in 
 size. And they tend to call each other.
 
 Quite frankly, I don't think we were really ready for 4k CPU's. I'm 
 going to commit this patch to make sure others don't do that many 
 CPU's by mistake. It marks MAXCPU's as being 'broken' so you cannot 
 select it, and also limits the number of CPU's that you _can_ select 
 to just 512.

yeah, that's OK i guess - distros can still enable 4K support if they 
wish to. Someone interested in improving the stack footprint situation 
should dust off the max-stack-footprint tracer so that we can catch 
these things in a more structured way.

And i guess the next generation of 4K CPUs support should just get away 
from cpumask_t-on-kernel-stack model altogether, as the current model is 
not maintainable. We tried the on-kernel-stack variant, and it really 
does not work reliably. We can fix this in v2.6.28.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread David Miller
From: Ingo Molnar [EMAIL PROTECTED]
Date: Tue, 26 Aug 2008 09:22:20 +0200

 And i guess the next generation of 4K CPUs support should just get away 
 from cpumask_t-on-kernel-stack model altogether, as the current model is 
 not maintainable. We tried the on-kernel-stack variant, and it really 
 does not work reliably. We can fix this in v2.6.28.

I recenetly did some work on sparc64 to use cpumask pointers
as much as possible.

The only case that didn't work was due to a limitation in
arch interfaces for the new generic smp_call_function() code.
It passes a cpumask_t instead of a pointer to one via
arch_send_call_function_ipi().

But other than that, the whole sparc64 SMP stuff uses cpumask_t
pointers only.

What it comes down to is that you have to do the self cpu
and other tests in the cross-call dispatch routines themselves,
instead of at the top-level working on cpumask_t objects.

Otherwise you have to modify cpumask_t objects and thus pluck
them onto the stack where they take up silly amounts of space.
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Yinghai Lu
On Tue, Aug 26, 2008 at 12:53 AM, Ingo Molnar [EMAIL PROTECTED] wrote:

 * David Miller [EMAIL PROTECTED] wrote:

 From: Ingo Molnar [EMAIL PROTECTED]
 Date: Tue, 26 Aug 2008 09:22:20 +0200

  And i guess the next generation of 4K CPUs support should just get away
  from cpumask_t-on-kernel-stack model altogether, as the current model is
  not maintainable. We tried the on-kernel-stack variant, and it really
  does not work reliably. We can fix this in v2.6.28.

 I recenetly did some work on sparc64 to use cpumask pointers as much
 as possible.

 The only case that didn't work was due to a limitation in arch
 interfaces for the new generic smp_call_function() code. It passes a
 cpumask_t instead of a pointer to one via
 arch_send_call_function_ipi().

 But other than that, the whole sparc64 SMP stuff uses cpumask_t
 pointers only.

wonder if could use unsigned long * directly.
so could dyn_array directly like

int cpumask_size;

unsigned long *online_cpu_map;
DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size,
PAGE_SIZE, NULL);

and after nr_cpu_ids is assigned, have
cpumask_size = (nr_cpu_ids + sizeof(unsigned long) - 1)/sizeof(unsigned long);

then we could NR_CPUS=4096 kernel to small system. ...

YH
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Yinghai Lu wrote:
 
 wonder if could use unsigned long * directly.

I would actually suggest something like this:

 - we continue to have a magic cpumask_t.

 - we do different cases for big and small NR_CPUS:

#if NR_CPUS = BITS_PER_LONG

/*
 * Make it an array - that way passing it as an argument will
 * always pass it as a pointer!
 */
typedef unsigned long cpumask_t[1];

static inline void create_cpumask(cpumask_t *p)
{
*p = 0;
}
static inline void free_cpumask(cpumask_t *p)
{
}

#else

typedef unsigned long *cpumask_t;

static inline void create_cpumask(cpumask_t *p)
{
*p = kcalloc(..);
}

static inline void free_cpumask(cpumask_t *p)
{
kfree(*p);
}

#endif

and now after you do this, you can just do something like

cpumask_t mycpu;

create_cpumask(mycpu);
..
free_cpumask(mycpu);

and in between, you can use 'cpumask' as a pointer, because even when it 
is an array directly allocated on the stack, the array can always 
degenerate into a pointer by C type rules!

And for the small-NR_CPUS case there is zero overhead.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Yinghai Lu
On Tue, Aug 26, 2008 at 9:51 AM, Linus Torvalds
[EMAIL PROTECTED] wrote:


 On Tue, 26 Aug 2008, Yinghai Lu wrote:

 wonder if could use unsigned long * directly.

 I would actually suggest something like this:

  - we continue to have a magic cpumask_t.

  - we do different cases for big and small NR_CPUS:

#if NR_CPUS = BITS_PER_LONG

/*
 * Make it an array - that way passing it as an argument will
 * always pass it as a pointer!
 */
typedef unsigned long cpumask_t[1];

static inline void create_cpumask(cpumask_t *p)
{
*p = 0;
}
static inline void free_cpumask(cpumask_t *p)
{
}

#else

typedef unsigned long *cpumask_t;

static inline void create_cpumask(cpumask_t *p)
{
*p = kcalloc(..);
}

static inline void free_cpumask(cpumask_t *p)
{
kfree(*p);
}

#endif

 and now after you do this, you can just do something like

cpumask_t mycpu;

create_cpumask(mycpu);
..
free_cpumask(mycpu);

 and in between, you can use 'cpumask' as a pointer, because even when it
 is an array directly allocated on the stack, the array can always
 degenerate into a pointer by C type rules!


that is good for local variables.

for global variables, need to allocate them in some point. may need one
int cpumask_size;

cpumask_t online_cpu_map;
DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size,
PAGE_SIZE, NULL);

or something like that.

YH
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Rusty Russell wrote:
 
 Your workaround is very random, and that scares me.  I think a huge number of 
 CPUs needs a real solution (an actual cpumask allocator, then do something 
 clever if we come across an actual fastpath).

The thing is, the inlining thing is a separate issue.

Yes, the cpumasks were what made stack pressure so critical to begin with, 
but no, a 400-byte stack frame in a deep callchain isn't acceptable 
_regardless_ of any cpumask_t issues.

Gcc inlining is a total and utter pile of shit. And _that_ is the problem. 
I seriously think we shouldn't allow gcc to inline anything at all unless 
we tell it to. That's how it used to work, and quite frankly, that's how 
it _should_ work.

The downsides of inlining are big enough from both a debugging and a real 
code generation angle (eg stack usage like this), that the upsides 
(_somesimes_ smaller kernel, possibly slightly faster code) simply aren't 
relevant.

So the noinline was random, yes, but this is a real issue. Looking at 
checkstack output for a saner config (NR_CPUS=16), the top entries for me 
are things like

ide_generic_init [vmlinux]: 1384
idefloppy_ioctl [vmlinux]:  1208
e1000_check_options [vmlinux]:  1152
...

which are leaf functions. They are broken as hell (the e1000 is 
apparently because it builds structs on the stack that should all be 
static const, for example), but they are different from something like 
the module init sequence in that they are not going to affect anything 
else.

It would be interesting to see what -fno-default-inline does to the 
kernel. It really would get rid of a _lot_ of gcc version issues too. 
Inlining behavior of gcc has long been a problem for us.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Adrian Bunk
On Tue, Aug 26, 2008 at 10:35:05AM -0700, Linus Torvalds wrote:
 
 
 On Tue, 26 Aug 2008, Rusty Russell wrote:
  
  Your workaround is very random, and that scares me.  I think a huge number 
  of 
  CPUs needs a real solution (an actual cpumask allocator, then do something 
  clever if we come across an actual fastpath).
 
 The thing is, the inlining thing is a separate issue.
 
 Yes, the cpumasks were what made stack pressure so critical to begin with, 
 but no, a 400-byte stack frame in a deep callchain isn't acceptable 
 _regardless_ of any cpumask_t issues.
 
 Gcc inlining is a total and utter pile of shit. And _that_ is the problem. 
 I seriously think we shouldn't allow gcc to inline anything at all unless 
 we tell it to. That's how it used to work, and quite frankly, that's how 
 it _should_ work.
 
 The downsides of inlining are big enough from both a debugging and a real 
 code generation angle (eg stack usage like this), that the upsides 
 (_somesimes_ smaller kernel, possibly slightly faster code) simply aren't 
 relevant.
...
 It would be interesting to see what -fno-default-inline does to the 
 kernel. It really would get rid of a _lot_ of gcc version issues too. 
 Inlining behavior of gcc has long been a problem for us.

I added -fno-inline-functions-called-once -fno-early-inlining to 
KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel 
image by 2%.

And when David's -fwhole-program --combine will become ready the cost
of disallowing gcc to inline functions will most likely increase.

A debugging option (for better traces) to disallow gcc some inlining 
might make sense (and might even make sense for distributions to 
enable in their kernels), but when you go to use cases that require
really small kernels the cost is too high.

But if you don't trust gcc's inlining you should revert
commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's 
freedom regarding what to inline in 2.6.27 - what gcc 4.2 does in the 
case of the regression tracked as Bugzilla #11276 is really not funny 
(two callers - function not inlined; gcc seems to emit the function 
although both callers later get removed (or at least should be removed) 
by dead code elimination).

   Linus

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Adrian Bunk wrote:
 
 A debugging option (for better traces) to disallow gcc some inlining 
 might make sense (and might even make sense for distributions to 
 enable in their kernels), but when you go to use cases that require
 really small kernels the cost is too high.

You ignore the fact that it's really not just about debugging.

Inlining really isn't the great tool some people think it is. Especially 
not since gcc stack allocation is so horrid that it won't re-use stack 
slots etc (which I don't disagree with per se - it's _hard_ to re-use 
stack slots while still allowing code scheduling).

NOTE! I also would never claim that _our_ choices of inline are all that 
great, and we've often inlined too much or not inlined things that really 
could be inlined. But at least when a developer says inline (or forgets 
to say it), we have somebody to blame. When the compiler does insane 
things that doesn't suit us, we're just screwed.

 But if you don't trust gcc's inlining you should revert
 commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's 
 freedom regarding what to inline in 2.6.27

Actually, that just allows gcc to _not_ inline. Which is probably ok.

(Well, it would be ok if gcc did it well enough, it obviously has some 
problems at times).

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Adrian Bunk wrote:
 
 I added -fno-inline-functions-called-once -fno-early-inlining to 
 KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel 
 image by 2%.

Btw, did you check with just -fno-inline-functions-called-once?

The -fearly-inlining decisions _should_ be mostly right. If gcc sees early 
that a function is so small (even without any constant propagation etc) 
that it can be inlined, it's probably right. 

The inline-functions-called-once thing is what causes even big functions 
to be inlined, and that's where you find the big downsides too (eg the 
stack usage).

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Mike Travis
Linus Torvalds wrote:
 
 On Mon, 25 Aug 2008, Linus Torvalds wrote:
 checkstack.pl shows these things as the top problems:

  0x80266234 smp_call_function_mask [vmlinux]:2736
  0x80234747 __build_sched_domains [vmlinux]: 2232
  0x8023523f __build_sched_domains [vmlinux]: 2232

 Anyway, the reason smp_call_function_mask and friends have such _huge_ 
 stack usages for you is that they contain a 'cpumask_t' on the stack.
 
 In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in 
 size. And they tend to call each other.
 
 Quite frankly, I don't think we were really ready for 4k CPU's. I'm going 
 to commit this patch to make sure others don't do that many CPU's by 
 mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and 
 also limits the number of CPU's that you _can_ select to just 512.
 
 Right now, 4k cpu's is known broken because of the stack usage. I'm not 
 willing to debug more of these kinds of stack smashers, they're really 
 nasty to work with. I wonder how many other random failures these have 
 been involved with?
 
 This patch also makes the ifdef mess in Kconfig much cleaner and avoids 
 duplicate definitions by just conditionally suppressing the question and 
 giving higher defaults. 
 
 We can enable MAXSMP and raise the CPU limits some time in the future. But 
 that future is not going to be before 2.6.27 - the code simply isn't ready 
 for it.
 
 The reason I picked 512 CPU's as the limit is that we _used_ to limit 
 things to 255. So it's higher than it used to be, but low enough to still 
 feel safe. Considering that a 4k-bit CPU mask (512 bytes) _almost_ worked, 
 the 512-bit (64 bytes) masks are almost certainly fine.
 
 Still, sane people should limit their NR_CPUS to 8 or 16 or something like 
 that. Very very few people really need the pain of big NR_CPUS. Not even 
 just 512 CPU's.
 
 Travis, Ingo and Thomas cc'd, since they were involved in the original 
 commit (1184dc2ffe2c8fb9afb766d870850f2c3165ef25) that raised the limit.
 
   Linus

Hi Linus,

[Sorry for the long winded response, but I felt that sufficient background
is needed to address this issue YOMV :-)]

The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is
critical to our upcoming SGI systems using what we have been calling
UV.  This system will be capable of supporting 4096 cpu threads in a
single system image (and 16k cpus/2k nodes is right around the corner).
While obviously I cannot divulge too many details, it's sufficient to
say there are customers who not only require this extended capability,
but are extremely excited about it.

But the nature of some of these system environments is that they will
not accept a specially built kernel, but only a kernel that has been
built and certified (both from the application standpoint as well as
the security standpoint) by standard distributions.  And you probably
know how extensively these distributions test and certify for many known
defects and absolutely require that incoming source changes come from
the community supported source bases, primarily yours.

Due to the lead time required to accomplish these certifications,
the version of the distributions that will be available when this
system releases will be based on 2.6.27.  (They will allow patches
post-2.6.27-rc.final as long as those are committed in the source base.)
The two distributions that SGI supports for our customers is SLES
(SUSE Linux Enterprise Server) and RHEL (Red Hat Enterprise Linux).
[They, of course, are free to run any OS of their choosing, but SGI only
provides front line support for those two.]

I started last August to begin analyzing how to accomplish the above goals
and where exactly are the hot spots in the kernel that would require
attention.  It quickly became clear that cpumask_t and nodemask_t are
two variables that are very casually used (along with NR_CPUS), because
the assumption was that 64 was more than sufficient for an upper limit
and even extending it to 128 or 255 (254 was the maximum IPI broadcast
ID until x2apic), only added a few more bytes here and there.

I chose not to introduce too many dramatic changes and instead analyzed
every instance where cpumask_t and NR_CPUS was being used (along with the
node counterparts.)  An initial proposal was to allow the default stack
size to be increased, but this was met with a lot of objections because
of the extensive work that was done to bring it to it's current size.

So in summary, the goals of the changes that I have been making since
last October are:

1. Allow a typical distro configured kernel with NR_CPUS=4096
   and NODES_SHIFT=9 to be booted on an x86_64 system with 2GB
   of memory.  (Some thought was given to use a 512Mb laptop
   as the base, but because of other memory bloat from using a
   64-bit kernel, that was considered not very useful.)

   [Note I 

Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Jamie Lokier
Linus Torvalds wrote:
 The inline-functions-called-once thing is what causes even big functions 
 to be inlined, and that's where you find the big downsides too (eg the 
 stack usage).

That's a bit bizarre, though, isn't it?

A function which is only called from one place should, if everything
made sense, _never_ use more stack through being inlined.  Inlining
should just increase the opportunities that the called function's
local variables can share the same stack slots are the caller's dead
locals.

Whereas not inlining guarantees they occupy separate, immediately
adjacent regions of the stack, and shouldn't be increasing the total
numbers of local variables.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Mike Travis
Ingo Molnar wrote:
 * Linus Torvalds [EMAIL PROTECTED] wrote:
 
 On Mon, 25 Aug 2008, Linus Torvalds wrote:
 checkstack.pl shows these things as the top problems:

 0x80266234 smp_call_function_mask [vmlinux]:2736
 0x80234747 __build_sched_domains [vmlinux]: 2232
 0x8023523f __build_sched_domains [vmlinux]: 2232

 Anyway, the reason smp_call_function_mask and friends have such _huge_ 
 stack usages for you is that they contain a 'cpumask_t' on the stack.
 In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in 
 size. And they tend to call each other.

 Quite frankly, I don't think we were really ready for 4k CPU's. I'm 
 going to commit this patch to make sure others don't do that many 
 CPU's by mistake. It marks MAXCPU's as being 'broken' so you cannot 
 select it, and also limits the number of CPU's that you _can_ select 
 to just 512.
 
 yeah, that's OK i guess - distros can still enable 4K support if they 
 wish to. Someone interested in improving the stack footprint situation 
 should dust off the max-stack-footprint tracer so that we can catch 
 these things in a more structured way.
 
 And i guess the next generation of 4K CPUs support should just get away 
 from cpumask_t-on-kernel-stack model altogether, as the current model is 
 not maintainable. We tried the on-kernel-stack variant, and it really 
 does not work reliably. We can fix this in v2.6.28.
 
   Ingo

I would be most interested in any tools to analyze call-trees and
accumulated stack usages.  My current method of using kdb is really
time consuming.

Thanks!
Mike
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Mike Travis wrote:
 
 The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is
 critical to our upcoming SGI systems using what we have been calling
 UV.

That's fine. You can do it. The default kernel will not, because it's 
clearly not safe.

I really don't care what you do to _your_ images. But I will not 
distribute a known-broken kernel, and I will not debug random stack 
overflows that happen in it.

If you want the default kernel to support 4k cores, we'll need to fix the 
stack usage.  I don't think that is impossible, but IT IS NOT GOING TO 
HAPPEN for 2.6.27.

And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by 
default, they are totally and utterly crazy.

But some SGI-specific binary that is meant for SGI machines only, and has 
been extensively tested with the setup used on SGI machines is a different 
thing.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Jamie Lokier wrote:
 
 A function which is only called from one place should, if everything
 made sense, _never_ use more stack through being inlined.

But that's simply not true.

See the whole discussion.

The problem is that if you inline that function, the stack usage of the 
newly inlined function is now added to ALL THE OTHER paths too!

So the case we had in module loading was that yes, we had a function with 
a big stack footprint, but it was NOT in the deep path.

But by inlining it, it now moved the stack footprint up one level to 
another function, and now the big stack footprint really _was_ in the deep 
path, because the caller was involved in a much deeper chain. 

So inlining moves the code up the callchain, and that is a problem for the 
backtrace, but that's just a debugging issue. But it also moves the 
stack footprint up the callchain, and that can actually be a correctness 
issue.

Of course, a compiler doesn't _have_ to do that. A compiler _could_ have 
multiple different stack footprints for a single function, and do liveness 
analysis etc. But no sane compiler probably does that, because it's very 
painful indeed, and it's not even an issue if you aren't stack-limited 
(and being stack-limited is really just a kernel thing).

(Yeah, it can be an issue even if you have a big stack, in that you get 
worse cache behaviour, so a dense stack footprint _would_ help. But the 
complexity of stack liveness analysis is almost certainly not worth the 
relatively small gains it would get on some odd cases).

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Mike Travis
Linus Torvalds wrote:
 
 On Tue, 26 Aug 2008, Mike Travis wrote:
 The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is
 critical to our upcoming SGI systems using what we have been calling
 UV.
 
 That's fine. You can do it. The default kernel will not, because it's 
 clearly not safe.
 
 I really don't care what you do to _your_ images. But I will not 
 distribute a known-broken kernel, and I will not debug random stack 
 overflows that happen in it.
 
 If you want the default kernel to support 4k cores, we'll need to fix the 
 stack usage.  I don't think that is impossible, but IT IS NOT GOING TO 
 HAPPEN for 2.6.27.
 
 And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by 
 default, they are totally and utterly crazy.
 
 But some SGI-specific binary that is meant for SGI machines only, and has 
 been extensively tested with the setup used on SGI machines is a different 
 thing.
 
   Linus

Ok, thanks for the reply, and looking into this issue.  We will strongly
encourage our distros to base the relevant releases on 2.6.28. :-)  

[Supplying an SGI-specific kernel would not be acceptable to many of our
customers because of the certification issues I mentioned.] 

Mike
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Dave Jones
On Tue, Aug 26, 2008 at 12:09:46PM -0700, Linus Torvalds wrote:

  If you want the default kernel to support 4k cores, we'll need to fix the 
  stack usage.  I don't think that is impossible, but IT IS NOT GOING TO 
  HAPPEN for 2.6.27.
  
  And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by 
  default, they are totally and utterly crazy.

heh.  *picks through Fedora changelog*

* Thu Aug 14 2008 Dave Jones [EMAIL PROTECTED]
- Bump max cpus supported on x86-64 to 4096. Just to see what happens.

I never did get to find out unfortunatly, because of the security fiasco
in Fedora infrastructure the last week or two.
 
  But some SGI-specific binary that is meant for SGI machines only, and has 
  been extensively tested with the setup used on SGI machines is a different 
  thing.

Every extra kernel image a distro vendor ends up shipping has an associated 
cost.

* build time: It currently takes about 2 hours for a set of Fedora RPMs.
  For RHEL it'll be even worse due to the extra archs).
  Killing off -smp specific builds was a big win for us in this regard.
  Adding extra flavours is always painful.

* diskspace (distro kernels aren't small. With the associated debugging symbols,
   they take up a shitload of disk space really fast).

* Having everyone running the same kernel makes it much easier to test/debug.
  Our QA guys hate adding extra columns to their test matrix.

But yes, for this to be even remotely feasible, there has to be a negligable
performance cost associated with it, which right now, we clearly don't have.
Given that the number of people running 4096 CPU boxes even in a few years time
will still be tiny, punishing the common case is obviously absurd.

Dave

-- 
http://www.codemonkey.org.uk
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Mike Travis wrote:
 
 I would be most interested in any tools to analyze call-trees and
 accumulated stack usages.  My current method of using kdb is really
 time consuming.

Well, even just scripts/checkstack.pl is quite relevant.

The fact is, anything with a stack footprint of more than a hundred bytes 
is suspect. We _do_ have a lot of cases of several hundred bytes, and some 
of them are even very intentional.

For an example of _intentional_ and valid large stacks, look at 
do_sys_poll and do_select. They both have a big stack footprint in a 
normal kernel, and that's on purpose - it's not pretty, but they are very 
common and performance-sensitive functions, and using a big stack allows 
some basic allocations to be much cheaper by default.

Same goes for early_printk(), although I don't think the reasons are 
really very strong in that case.

Sadly, while those functions are _fairly_ high up, they aren't at the top, 
and we do have a lot of other functions that have huge stack footprints 
for totally bogus reasons. But the intentional ones are at least in the 
top ten.

But the kernel that Alan had problems with was different. The 
_intentional_ ones were way down in the noise.  do_sys_poll wasn't in the 
top ten, it was barely even in the top 50! (It was in fact #49, to be 
exact).

So look at the top ten in my kernel:

 1  ide_generic_init [vmlinux]: 1384
 2  idefloppy_ioctl [vmlinux]:  1208
 3  e1000_check_options [vmlinux]:  1152
 4  do_sys_poll [vmlinux]:  904
 5  ide_floppy_get_capacity [vmlinux]:  872
 6  do_select [vmlinux]:744
 7  early_printk [vmlinux]: 720
 8  do_task_stat [vmlinux]: 680
 9  mmc_ioctl [vmlinux]:648
10  elf_kcore_store_hdr [vmlinux]:  576

.. and in Alan's kernel:

 1  smp_call_function_mask [vmlinux]:   2736
 2  __build_sched_domains [vmlinux]:2232
 3  setup_IO_APIC_irq [vmlinux]:1616
 4  arch_setup_ht_irq [vmlinux]:1600
 5  arch_setup_msi_irq [vmlinux]:   1600
 6  __assign_irq_vector [vmlinux]:  1592
 7  move_task_off_dead_cpu [vmlinux]:   1592
 8  tick_handle_oneshot_broadcast [vmlinux]:1544
 9  store_scaling_governor [vmlinux]:   1376
10  cpuset_write_resmask [vmlinux]: 1360

That's a big difference. The top #1 in my kernel would just _barely_ be in 
the top 10 in Alan's kernel (he doesn't have it at all, because he didn't 
compile the drives I did into the kernel).

And the top three in my kernel are just because of crap code. That 
e1000_check_options thing is there just because it creates multiple 
struct e1000_option structures. I wrote an ugly but totally trivial 
patch to get it down to ~600 bytes, and it would be less if I had bothered 
to waste any more time on it.

The others are similar issues of people just didn't think.

But look at the top ones in Alan's kernel. Not only are they _much_ bigger 
than the top ones in a sane kernel, they are _all_ due to cpumask_t, I 
think.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Jeff Garzik

Linus Torvalds wrote:
The downsides of inlining are big enough from both a debugging and a real 
code generation angle (eg stack usage like this), that the upsides 
(_somesimes_ smaller kernel, possibly slightly faster code) simply aren't 
relevant.


So the noinline was random, yes, but this is a real issue. Looking at 
checkstack output for a saner config (NR_CPUS=16), the top entries for me 
are things like


ide_generic_init [vmlinux]: 1384
idefloppy_ioctl [vmlinux]:  1208
e1000_check_options [vmlinux]:  1152
...

which are leaf functions. They are broken as hell (the e1000 is 
apparently because it builds structs on the stack that should all be 
static const, for example), but they are different from something like 
the module init sequence in that they are not going to affect anything 
else.



e1000_check_options builds a struct (singular) on the stack, really... 
struct e1000_option is reasonably small.


The problem, which has also shown itself in large ioctl-style switch{} 
statements, is that gcc will generate code such that the stack usage 
from independent code branches


if {cond1} {
char buster1[1000];
foo(buster1);
} else if (cond2) {
char buster2[1000];
foo(buster2);
}

are added together, not noticed as mutually exclusive.

Of course, adding 'static const' as you noted is a reasonable 
workaround, but gcc is really annoying WRT stack allocation in this manner.


I had problems in the past, before struct ethtool_ops, with like ethtool 
ioctl switch statements using gobs of stack.  In fact, that was a big 
motivation for struct ethtool_ops.


Jeff


--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Mike Travis
Dave Jones wrote:
...
 
 But yes, for this to be even remotely feasible, there has to be a negligable
 performance cost associated with it, which right now, we clearly don't have.
 Given that the number of people running 4096 CPU boxes even in a few years 
 time
 will still be tiny, punishing the common case is obviously absurd.
 
   Dave
 

I did do some fairly extensive benchmarking between configs of NR_CPUS = 128 and
4096 and most performance hits were in the neighborhood of  5% on systems with
8 cpus and 4GB of memory (our most common test system).  [But changing 
cpumask_t's
to be pointers instead of values will likely increase this.]  I've tried to be
very sensitive to this issue with all my previous changes, so convincing the 
distros
to set NR_CPUS=4096 would be as painless for them as possible. ;-)

Btw, huge count cpu systems I don't think are that far away.  I believe the 
nextgen
Larabbee chips will be geared towards HPC applications [instead of just GFX 
apps],
and putting 4 of these chips on a motherboard would add up to 512 cpu threads 
(1024
if they support hyperthreading.)

Thanks,
Mike
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 horridness (was Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected)

2008-08-26 Thread Kok, Auke
Linus Torvalds wrote:
 
 On Tue, 26 Aug 2008, Jeff Garzik wrote:
 e1000_check_options builds a struct (singular) on the stack, really... struct
 e1000_option is reasonably small.
 
 No it doesn't.
 
 Look a bit more closely.
 
 It builds a struct (singular) MANY MANY times. It also then builds up a 
 huge e1000_opt_list[] array, even though it is const and should be static 
 (and const).
 
 I know. I wrote a patch to FIX it.

totally cool patch afaics - if I still maintained this driver I'd have this 
tested
and merged right away :)

I suppose Jeff Kirsher is already doing so right now.

I suppose that he'll have to look at the other Intel ethernet drivers as well :)

Jeff, please add my:

Reveiewed-by: Auke Kok [EMAIL PROTECTED]

Cheers,

Auke

 
 Here's the patch. It shrinks the stack from 1152 bytes to 192 bytes (the 
 first version, that only did the e1000_option part, got it down to 600 
 bytes). About half comes from not using multiple e1000_option 
 structures, the other half comes from turning the e1000_opt_list[] 
 arrays into static const instead, so that gcc doesn't copy them onto the 
 stack.
 
 Most of the patch is actually doing things like turning
 
   struct struct e1000_option opt = {
 
 (which declares a _new_ e1000_option variable each time) into
 
   opt = (struct e1000_option) {
 
 which just re-uses the single variable.
 
 It becomes slightly larger than that, because some places the opt = .. 
 had to be moved around, since it's no longer a variable declaration, but a 
 regular assignment.
 
 The rest is just adding const to the right places, and turning
 
   struct e1000_opt_list speed_list[] = ..
 
 into
 
   static const struct e1000_opt_list speed_list[] = ..
 
 instead, and fixing the indentation to be more straightforward.
 
 I have not tested the dang thing, but I think it's correct. And it turns 
 stack usage from totally horrible and broken into pretty reasonable.
 
   Linus
 
 ---
  drivers/net/e1000/e1000_param.c |   81 +-
  1 files changed, 45 insertions(+), 36 deletions(-)
 
 diff --git a/drivers/net/e1000/e1000_param.c b/drivers/net/e1000/e1000_param.c
 index b9f90a5..213437d 100644
 --- a/drivers/net/e1000/e1000_param.c
 +++ b/drivers/net/e1000/e1000_param.c
 @@ -208,7 +208,7 @@ struct e1000_option {
   } r;
   struct { /* list_option info */
   int nr;
 - struct e1000_opt_list { int i; char *str; } *p;
 + const struct e1000_opt_list { int i; char *str; } *p;
   } l;
   } arg;
  };
 @@ -242,7 +242,7 @@ static int __devinit e1000_validate_option(unsigned int 
 *value,
   break;
   case list_option: {
   int i;
 - struct e1000_opt_list *ent;
 + const struct e1000_opt_list *ent;
  
   for (i = 0; i  opt-arg.l.nr; i++) {
   ent = opt-arg.l.p[i];
 @@ -279,7 +279,9 @@ static void e1000_check_copper_options(struct 
 e1000_adapter *adapter);
  
  void __devinit e1000_check_options(struct e1000_adapter *adapter)
  {
 + struct e1000_option opt;
   int bd = adapter-bd_number;
 +
   if (bd = E1000_MAX_NIC) {
   DPRINTK(PROBE, NOTICE,
  Warning: no configuration for board #%i\n, bd);
 @@ -287,19 +289,21 @@ void __devinit e1000_check_options(struct e1000_adapter 
 *adapter)
   }
  
   { /* Transmit Descriptor Count */
 - struct e1000_option opt = {
 + struct e1000_tx_ring *tx_ring = adapter-tx_ring;
 + int i;
 + e1000_mac_type mac_type = adapter-hw.mac_type;
 +
 + opt = (struct e1000_option) {
   .type = range_option,
   .name = Transmit Descriptors,
   .err  = using default of 
   __MODULE_STRING(E1000_DEFAULT_TXD),
   .def  = E1000_DEFAULT_TXD,
 - .arg  = { .r = { .min = E1000_MIN_TXD }}
 + .arg  = { .r = {
 + .min = E1000_MIN_TXD,
 + .max = mac_type  e1000_82544 ? E1000_MAX_TXD : 
 E1000_MAX_82544_TXD
 + }}
   };
 - struct e1000_tx_ring *tx_ring = adapter-tx_ring;
 - int i;
 - e1000_mac_type mac_type = adapter-hw.mac_type;
 - opt.arg.r.max = mac_type  e1000_82544 ?
 - E1000_MAX_TXD : E1000_MAX_82544_TXD;
  
   if (num_TxDescriptors  bd) {
   tx_ring-count = TxDescriptors[bd];
 @@ -313,19 +317,21 @@ void __devinit e1000_check_options(struct e1000_adapter 
 *adapter)
   tx_ring[i].count = tx_ring-count;
   }
   { /* Receive Descriptor Count */
 - struct e1000_option opt = {
 + struct e1000_rx_ring *rx_ring = adapter-rx_ring;
 + 

Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Adrian Bunk
On Tue, Aug 26, 2008 at 11:40:10AM -0700, Linus Torvalds wrote:
 
 
 On Tue, 26 Aug 2008, Adrian Bunk wrote:
  
  A debugging option (for better traces) to disallow gcc some inlining 
  might make sense (and might even make sense for distributions to 
  enable in their kernels), but when you go to use cases that require
  really small kernels the cost is too high.
 
 You ignore the fact that it's really not just about debugging.

I had in mind that we anyway have to support it for tiny kernels.

I simply don't see that we add kconfig options for 5kB of code for
tiny kernels but remove something like this that can cause size
increases  1%.

 Inlining really isn't the great tool some people think it is. Especially 
 not since gcc stack allocation is so horrid that it won't re-use stack 
 slots etc (which I don't disagree with per se - it's _hard_ to re-use 
 stack slots while still allowing code scheduling).

gcc's stack allocation has become better
(that's why we disable unit-at-a-time only for gcc 3.4 on i386).

 NOTE! I also would never claim that _our_ choices of inline are all that 
 great, and we've often inlined too much or not inlined things that really 
 could be inlined. But at least when a developer says inline (or forgets 
 to say it), we have somebody to blame. When the compiler does insane 
 things that doesn't suit us, we're just screwed.

Most LOCs of the kernel are not written by people like you or Al Viro or 
David Miller, and the average kernel developer is unlikely to do it as 
good as gcc.

For the average driver the choice is realistically between
inline's randomly sprinkled across the driver and
no inline's, leave it to gcc.

And code evolves during the years from tiny with 1 caller to huge with 
many callers.

BTW:
I just ran checkstack on a (roughly) allyesconfig kernel, and we have a 
new driver that allocates unsigned char recvbuf[1500]; on the stack...

  But if you don't trust gcc's inlining you should revert
  commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's 
  freedom regarding what to inline in 2.6.27
 
 Actually, that just allows gcc to _not_ inline. Which is probably ok.
 
 (Well, it would be ok if gcc did it well enough, it obviously has some 
 problems at times).

With the gcc inline's static functions you complain about we have
4-5 years of experience.

Suddenly allowing 4 release series of gcc to ignore any inline's is a 
completely new area for us. I'd generally agree with giving gcc more 
freedom here, but I'd rather do it right by removing tons of wrong 
inline's than doing one global change hoping that it will make things 
better.

And whether the optimized inlining actually makes the kernel bigger or 
smaller depends in my experience on the .config and the gcc version.

   Linus

cu
Adrian

[1] there are some rare exceptions

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Adrian Bunk wrote:
 
 I had in mind that we anyway have to support it for tiny kernels.

I actually don't think that is true.

If we really were to decide to be stricter about it, and it makes a big 
size difference, we can probably also add a tool to warn about functions 
that really should be inline.

  Inlining really isn't the great tool some people think it is. Especially 
  not since gcc stack allocation is so horrid that it won't re-use stack 
  slots etc (which I don't disagree with per se - it's _hard_ to re-use 
  stack slots while still allowing code scheduling).
 
 gcc's stack allocation has become better
 (that's why we disable unit-at-a-time only for gcc 3.4 on i386).


I agree that it has become better. But it still absolutely *sucks*.

For example, see the patch I just posted about e1000 stack usage. Even 
though the variables were all in completely separate scopes, they all got 
individual space on the stack over the whole lifetime of the function, 
causing an explosion of stack-space. As such, gcc used 500 bytes too much 
of stack, just because it didn't re-use the stackspace.

That was with gcc-4.3.0, and no, there were hardly any inlining issues 
involevd, although it is true that inlining actually did make it slightly 
worse in that case too (but since it was essentially a leaf function, that 
had little real life impact, since there were no deep callchains below it 
to care).

So the fact is, better simply is not good enough. We still need to do 
a lot of optimizations _manually_, because gcc cannot see that it can 
re-use the stack-slots.

And sometimes those optimizations are actually performance 
pessimizations, because in order to make gcc not use all the stack at the 
same time, you simply have to break things out and force-disable inlining.

 Most LOCs of the kernel are not written by people like you or Al Viro or 
 David Miller, and the average kernel developer is unlikely to do it as 
 good as gcc.

Sure. But we do have tools. We do have checkstack.pl, it's just that it 
hasn't been an issue in a long time, so I suspect many people didn't even 
_realize_ we have it, and I certainly can attest to the fact that even 
people who remember it - like me - don't actually tend to run it all that 
often.

 For the average driver the choice is realistically between
 inline's randomly sprinkled across the driver and
 no inline's, leave it to gcc.

And neither is likely to be a big problem.

 BTW:
 I just ran checkstack on a (roughly) allyesconfig kernel, and we have a 
 new driver that allocates unsigned char recvbuf[1500]; on the stack...

Yeah, it's _way_ too easy to do bad things.

 With the gcc inline's static functions you complain about we have
 4-5 years of experience.

Sure. And most of it isn't all that great.

But I do agree that lettign gcc make more decisions is _dangerous_. 
However, in this case, at least, the decisions it makes would at least 
make for less inlining, and thus less stack space explosion.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread David Miller
From: Mike Travis [EMAIL PROTECTED]
Date: Tue, 26 Aug 2008 12:06:18 -0700

 David Miller wrote:
  The only case that didn't work was due to a limitation in
  arch interfaces for the new generic smp_call_function() code.
  It passes a cpumask_t instead of a pointer to one via
  arch_send_call_function_ipi().
  
  But other than that, the whole sparc64 SMP stuff uses cpumask_t
  pointers only.
  
  What it comes down to is that you have to do the self cpu
  and other tests in the cross-call dispatch routines themselves,
  instead of at the top-level working on cpumask_t objects.
  
  Otherwise you have to modify cpumask_t objects and thus pluck
  them onto the stack where they take up silly amounts of space.
 
 Yes, I had proposed either modifying, or supplementing a new
 smp_call function to pass the cpumask_t as a pointer (similar
 to set_cpus_allowed_ptr.)  But an ABI change such as this was
 not well received at the time.

What it seems to come down to is that any cpumask_t not inside of
a dynamically allocated object should be marked const.

And that is something we can enforce at compile time.

Linus has just suggested dynamically allocating cpumask_t's
for such cases but I don't see that as the fix either.

Just mark them const and enforce that cpumask_t objects can only
be modified when they appear in dynamically allocated objects.

You really don't need to modify the ones that passed around functions
anyways.  The only code that wants to change bits in these things is
the cpu cross-call dispatch stuff, and that cpu choice logic can just
live where it belongs down in the cross-call dispatch code.

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Adrian Bunk
On Tue, Aug 26, 2008 at 11:47:01AM -0700, Linus Torvalds wrote:
 
 
 On Tue, 26 Aug 2008, Adrian Bunk wrote:
  
  I added -fno-inline-functions-called-once -fno-early-inlining to 
  KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel 
  image by 2%.
 
 Btw, did you check with just -fno-inline-functions-called-once?
 
 The -fearly-inlining decisions _should_ be mostly right. If gcc sees early 
 that a function is so small (even without any constant propagation etc) 
 that it can be inlined, it's probably right. 
 
 The inline-functions-called-once thing is what causes even big functions 
 to be inlined, and that's where you find the big downsides too (eg the 
 stack usage).

-fno-inline-functions-called-once alone costs me nearly 1% in code size.

And I'd expect it to become more with -fwhole-program --combine.


If you think we have too many stacksize problems I'd suggest to consider 
removing the choice of 4k stacks on i386, sh and m68knommu instead of 
using -fno-inline-functions-called-once:

Now that 32bit x86 is no longer used for extreme highend configurations 
the only serious usecase for 4k stacks are AFAIK space savings on 
embedded archs.

4k stacks have caused us much pain [1], and the cases where gcc inlined 
too much were the easy ones.

I'm not saying that I'd like removing the choice of 4k stacks, but if we 
want to reduce the number of stack related problems that's IMHO the 
better alternative.


   Linus

cu
Adrian

[1] AFAIR some callpaths in the kernel are still too big

BTW: In case anyone wonders about why I suggest removing 4k stacks:
 My position is that 4k stacks should either be enabled 
 unconditionally or no longer offered at all.
 And if we remove 4k stacks from 32bit x86 it's no longer 
 realistically maintainable for other architectures.

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Adrian Bunk wrote:
 
 If you think we have too many stacksize problems I'd suggest to consider 
 removing the choice of 4k stacks on i386, sh and m68knommu instead of 
 using -fno-inline-functions-called-once:

Don't be silly. That makes the problem _worse_.

We're much better off with a 1% code-size reduction than forcing big 
stacks on people. The 4kB stack option is also a good way of saying if it 
works with this, then 8kB is certainly safe.

And embedded people (the ones that might care about 1% code size) are the 
ones that would also want smaller stacks even more!

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 horridness (was Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected)

2008-08-26 Thread Jeff Kirsher
On Tue, Aug 26, 2008 at 1:14 PM, Kok, Auke [EMAIL PROTECTED] wrote:
 Linus Torvalds wrote:

 On Tue, 26 Aug 2008, Jeff Garzik wrote:
 e1000_check_options builds a struct (singular) on the stack, really... 
 struct
 e1000_option is reasonably small.

 No it doesn't.

 Look a bit more closely.

 It builds a struct (singular) MANY MANY times. It also then builds up a
 huge e1000_opt_list[] array, even though it is const and should be static
 (and const).

 I know. I wrote a patch to FIX it.

 totally cool patch afaics - if I still maintained this driver I'd have this 
 tested
 and merged right away :)

 I suppose Jeff Kirsher is already doing so right now

You suppose correctly.
.

 I suppose that he'll have to look at the other Intel ethernet drivers as well 
 :)

 Jeff, please add my:

 Reveiewed-by: Auke Kok [EMAIL PROTECTED]


Will do.

-- 
Cheers,
Jeff
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Parag Warudkar
On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds
[EMAIL PROTECTED] wrote:

 And embedded people (the ones that might care about 1% code size) are the
 ones that would also want smaller stacks even more!

This is something I never understood - embedded devices are not going
to run more than a few processes and 4K*(Few Processes)
 IMHO is not worth a saving now a days even in embedded world given
falling memory prices. Or do I misunderstand?

Parag
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread David VomLehn

Parag Warudkar wrote:

On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds
[EMAIL PROTECTED] wrote:


And embedded people (the ones that might care about 1% code size) are the
ones that would also want smaller stacks even more!


This is something I never understood - embedded devices are not going
to run more than a few processes and 4K*(Few Processes)
 IMHO is not worth a saving now a days even in embedded world given
falling memory prices. Or do I misunderstand?


Embedded applications span a huge range of sizes, from the very small devices to 
which you refer, to quite complex devices. The cable settop boxes we develop have 
over a hundred interrupt sources, typically run 250-300 threads, and have 192+ 
MiB of memory. For all that, we are very cost sensitive and are under constant 
pressure to come up with reliable ways to save memory.



Parag

--
David VomLehn


--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Adrian Bunk
On Tue, Aug 26, 2008 at 04:00:33PM -0700, David VomLehn wrote:
 Parag Warudkar wrote:
 On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds
 [EMAIL PROTECTED] wrote:

 And embedded people (the ones that might care about 1% code size) are the
 ones that would also want smaller stacks even more!

 This is something I never understood - embedded devices are not going
 to run more than a few processes and 4K*(Few Processes)
  IMHO is not worth a saving now a days even in embedded world given
 falling memory prices. Or do I misunderstand?

 Embedded applications span a huge range of sizes, from the very small 
 devices to which you refer, to quite complex devices. The cable settop 
 boxes we develop have over a hundred interrupt sources, typically run 
 250-300 threads, and have 192+ MiB of memory. For all that, we are very 
 cost sensitive and are under constant pressure to come up with reliable 
 ways to save memory.

As you say correctly the term embedded gets used for many different 
devices.

And if you have 192+ MiB of memory you have so much that all these 
kernel size discussions don't really matter.

 David VomLehn

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Wed, 27 Aug 2008, Adrian Bunk wrote:
  
  We're much better off with a 1% code-size reduction than forcing big 
  stacks on people. The 4kB stack option is also a good way of saying if it 
  works with this, then 8kB is certainly safe.
 
 You implicitely assume both would solve the same problem.

I'm just saying that your logic doesn't hold water.

If we can save kernel stack usage, then a 1% increase in kernel size is 
more than worth it.

 While 4kB stacks are something we anyway never got 100% working

What? Don't be silly. 

Linux _historically_ always used 4kB stacks.

No, they are likely not usable on x86-64, but dammit, they should be more 
than usable on x86-32 still.

 But I do not think the problem you'd solve with 
 -fno-inline-functions-called-once is big enough to warrant the size 
 increase it causes.

You continually try to see the inlining as a single solution to one 
problem (debuggability, stack, whatever).

The biggest problem with gcc inlining has always been that it has been 
_unpredictable_. It causes problems in many different ways. It has caused 
stability issues due to gcc versions doing random things. It causes the 
stack expansion. It makes stack traces harder for debugging, etc.

If it was any one thing, I wouldn't care. But it's exactly the fact that 
it causes all these problems in different areas.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Adrian Bunk
On Tue, Aug 26, 2008 at 04:51:52PM -0700, Linus Torvalds wrote:
 
 
 On Wed, 27 Aug 2008, Adrian Bunk wrote:
   
   We're much better off with a 1% code-size reduction than forcing big 
   stacks on people. The 4kB stack option is also a good way of saying if 
   it 
   works with this, then 8kB is certainly safe.
  
  You implicitely assume both would solve the same problem.
 
 I'm just saying that your logic doesn't hold water.
 
 If we can save kernel stack usage, then a 1% increase in kernel size is 
 more than worth it.

From some tests the size increase seems to become bigger for smaller 
kernels, but I don't have any really good data.


An interesting question is why most of our architectures for embedded 
devices only offer bigger stacks:

The only architectures offering a 4kB stacks option are:
- m68knommu
- sh
- 32bit x86

The following architectures that are used in embedded devices 
always use 8kB stacks (or bigger) in your tree:
- arm
- avr32
- blackfin
- cris
- frv
- h8300
- m32r
- m68k
- mips
- mn10300 (has an #ifdef CONFIG_4KSTACKS but no kconfig option)
- powerpc
- xtensa


  While 4kB stacks are something we anyway never got 100% working
 
 What? Don't be silly. 
 
 Linux _historically_ always used 4kB stacks.
 
 No, they are likely not usable on x86-64, but dammit, they should be more 
 than usable on x86-32 still.


When did we get callpaths like like nfs+xfs+md+scsi reliably 
working with 4kB stacks on x86-32?


...
   Linus

cu
Adrian

-- 

   Is there not promise of rain? Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   Only a promise, Lao Er said.
   Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Wed, 27 Aug 2008, Adrian Bunk wrote:
 
 When did we get callpaths like like nfs+xfs+md+scsi reliably 
 working with 4kB stacks on x86-32?

XFS may never have been usable, but the rest, sure.

And you seem to be making this whole argument an excuse to SUCK, adn an 
excuse to let gcc crap even more on our stack space.

Why?

Why aren't you saying that we should be able to do better? Instead, you 
seem to asking us to do even worse than we do now?

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Parag Warudkar
On Tue, Aug 26, 2008 at 8:53 PM, Greg Ungerer [EMAIL PROTECTED] wrote:

 I have some simple devices (network access/routers) with 8MB of RAM,
 at power up not really being configured to do anything running 25
 processes. (Heck there is over 10 kernel processes running!). Configure
 some interfaces and services and that will easily push past 40.
 I'd be happy with a 160k saving :-)


So you really need to run all 25 processes on that 8Mb box?
(For reference even the NGW100 development board comes with 16Mb RAM).

Even if you do need those all 25 processes on the 8Mb box, fixing the
memory usage of those user space hogs is lot better than trying to
save 160Kb in kernel stacks.
Last I looked, user space wasn't particularly frugal with memory usage.

Parag
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Parag Warudkar wrote:

 And although you said in your later reply that Linux x86 with 4K
 stacks should be more than usable - my experiences running a untainted
 desktop/file server with 4K stack have been always disastrous XFS or
 not.  It _might_ work for some well defined workloads but you would
 not want to risk 4K stacks otherwise.

Umm. How long?

4kB used to be the _only_ choice. And no, there weren't even irq stacks. 
So that 4kB was not just the whole kernel call-chain, it was also all the 
irq nesting above it.

And yes, we've gotten much worse over time, and no, I can't really suggest 
going back to that in general. The code bloat has certainly been 
accompanied by a stack bloat too.

But part of it is definitely gcc. Some versions of gcc used to be 
absolutely _horrid_ when it came to stack usage, especially with some 
flags, and especially with the crazy inlining that module-at-a-time 
caused.

But I'd be really happy if some embedded people tried to take some of that 
bloat back, and aim for 4kB stacks. Because it's definitely not 
unrealistic. At least it _shouldn't_ be. And a lot of the cases of us 
having structures on the stack is actually not worth it, and tends to be 
about being lazy rather than anything else.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-26 Thread Linus Torvalds


On Tue, 26 Aug 2008, Parag Warudkar wrote:
 
 What about deep call chains? The problem with the uptake of 4K stacks
 seems to be that is not reliably provable that it will work under all
 circumstances.

Umm. Neither is 8k stacks. Nobody proved anything.

But yes, some subsystems have insanely deep call chains. And yes, things 
like the VFS recursion (for symlinks) makes that deeper yet for 
filesystems, although only on the lookup path. And that is exactly the 
kind of thing that can exacerbate the problem of the compiler artificially 
making for a bigger stack footprint of a function (*).

For things like the VFS layer, right now we allow a nesting level of 8, I 
think. If I remember correctly, it was 5 historically. Part of raising 
that depth, though, was that we actually moved the recursive part into 
fs/namei.c, and the nesting stack-depth was something pretty damn small 
when the filesystem used follow_link properly and let the VFS do it for 
it (ie the callchain to actually look up the link could be deep, but it 
would not recurse back, and instead just return a pointer, so that the 
actual _recursive_ part was just __do_follow_link() and is just a few 
words on the stack).

So yes, we do have some deep callchains, but they tend to be pretty well 
managed for _good_ code. The problems tend to be the areas with lots of 
indirection layers, and yeah, XFS, MD and ACPI all have those kinds of 
things.

In an embdedded world, many of those should be a non-issue, though. 

Linus

(*) ie the function that _is_ on the deep chain doesn't actually need much 
of a stack footprint at all itself, but it may call a helper function that 
is _not_ in the deep chain, and if it gets inlined it may give its 
excessive stack footprint to the deep chain - and this is _exactly_ the 
problem that happened with inlining load_module().
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Alan D. Brunelle
Linus Torvalds wrote:
 
 On Sat, 23 Aug 2008, Linus Torvalds wrote:
 This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but 
 then the call chain shows that there is no interrupt going on.
 
 Ahh, later in that thread there's another totally unrelated oops in 
 debug_mutex_add_waiter().
 
 I'd guess that it is really wild pointer corrupting memory, quite possibly 
 due to a double free or something like that. Alan - it would be good to 
 run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do 
 that?
 
   Linus
 

I'll add those in - as to the repeatability: The bad kernels seem to
repeat quite reliably - not only in terms of counts (5 or 6 times in a
row before trying something else), but also in terms of the what -
either the original issue () or the other kernel with the later issue
(debug_mutex_add_waiter). That's /goodness/ in that it should help
narrow it down.

I'll make sure the kernel is still failing this morning, and then add in
DEBUG_PAGE_ALLOC and if that doesn't help, SLUB debugging...

Alan
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Alan D. Brunelle
Arjan van de Ven wrote:

 
 Wonder what gcc is in use?
 (newer ones tend to be a ton better... but maybe Alex is using a really
 old one)

I'm running Ubuntu 8.04 w/ gcc:

gcc (GCC) 4.2.3 (Ubuntu 4.2.3-2ubuntu7)

Alan
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Alan D. Brunelle
Before adding any more debugging, this is the status of my kernel boots:
3 times in a row w/ this same error. (Primary problem is the same,
secondary stacks differ of course.)

Alan
Loading, please [6.482953] busybox used greatest stack depth: 4840 bytes 
left
wait...
[6.521876] all_generic_ide used greatest stack depth: 4784 bytes left
Begin: Loading essential drivers... ...
[6.625509] fuse init (API version 7.9)
[6.625509] modprobe used greatest stack depth: 1720 bytes left
[6.644854] ACPI: SSDT CFFD0D0A, 08C4 (r1 HPQOEM  CPU_TM21 MSFT  
10E)
[6.651489] BUG: unable to handle kernel NULL pointer dereference at 
0858
[6.655631] IP: [8025e302] debug_mutex_add_waiter+0x32/0x80
[6.655631] PGD 21a0a4067 PUD 21a4bd067 PMD 0
[6.655631] Oops: 0002 [1] SMP
[6.655631] CPU 1
[6.655631] Modules linked in: processor(+) fan thermal_sys fuse
[6.655631] Pid: 1259, comm: modprobe Not tainted 2.6.27-rc3 #29
[6.655631] RIP: 0010:[8025e302]  [8025e302] 
debug_mutex_add_waiter+0x32/0x80
[6.655631] RSP: 0018:88021a4e7998  EFLAGS: 00010002
[6.655631] RAX:  RBX: 88021a4e79d8 RCX: 
[6.655631] RDX: 0001 RSI: 88021a4e79d8 RDI: a0091a60
[6.655631] RBP: 88021a4e79b8 R08: 811deff0 R09: 8800a6fdb000
[6.655631] R10: a008f524 R11:  R12: a0091a60
[6.655631] R13: 88021a4e6000 R14: 88021a9c40a0 R15: a0091a98
[6.655631] FS:  7f233f11d6e0() GS:88022fc02a00() 
knlGS:
[6.655631] CS:  0010 DS:  ES:  CR0: 8005003b
[6.655631] CR2: 0858 CR3: 00021a07e000 CR4: 06e0
[6.655631] DR0:  DR1:  DR2: 
[6.655631] DR3:  DR6: 0ff0 DR7: 0400
[6.655631] Process modprobe (pid: 1259, threadinfo 88021a4e6000, task 
88021a9c40a0)
[6.655631] Stack:   a0091a60 0246 
a008f524
[6.655631]  88021a4e7a38 8049f596 a008f524 
a0091a18
[6.655631]  88021a4e79d8 88021a4e79d8  

[6.655631] Call Trace:
[6.655631]  [a008f524] ? get_idr+0x44/0xa0 [thermal_sys]
[6.655631]  [8049f596] mutex_lock_nested+0xa6/0x250
[6.655631]  [a008f524] ? get_idr+0x44/0xa0 [thermal_sys]
[6.655631]  [803635c4] ? idr_pre_get+0x44/0x90
[6.655631]  [a008f524] get_idr+0x44/0xa0 [thermal_sys]
[6.655631]  [a008fe43] thermal_cooling_device_register+0x83/0x250 
[thermal_sys]
[6.655631]  [a019b2a3] acpi_processor_start+0x64b/0x774 
[processor]
[6.655631]  [8031a94b] ? __sysfs_add_one+0x6b/0xa0
[6.655631]  [8031ba3c] ? sysfs_do_create_link+0xbc/0x150
[6.655631]  [803a7f5e] acpi_start_single_object+0x2d/0x52
[6.655631]  [803a9556] acpi_device_probe+0x7e/0x92
[6.655631]  [803dd3eb] driver_probe_device+0x9b/0x1a0
[6.655631]  [803dd576] __driver_attach+0x86/0x90
[6.655631]  [803dd4f0] ? __driver_attach+0x0/0x90
[6.655631]  [803dc93d] bus_for_each_dev+0x5d/0x90
[6.655631]  [803dd22c] driver_attach+0x1c/0x20
[6.655631]  [803dcf79] bus_add_driver+0x1e9/0x260
[6.655631]  [a0222000] ? acpi_processor_init+0x0/0x107 [processor]
[6.655631]  [803dd74f] driver_register+0x5f/0x140
[6.655631]  [a0222000] ? acpi_processor_init+0x0/0x107 [processor]
[6.655631]  [803a9866] acpi_bus_register_driver+0x3e/0x40
[6.655631]  [a0222094] acpi_processor_init+0x94/0x107 [processor]
[6.655631]  [80209040] _stext+0x40/0x180
[6.655631]  [802a8911] ? __vunmap+0xa1/0x110
[6.655631]  [802676c2] sys_init_module+0x142/0x1dc0
[6.655631]  [80367b16] ? __up_read+0x46/0xb0
[6.655631]  [8048e570] ? cpu_down+0x0/0x70
[6.655631]  [8020c34b] system_call_fastpath+0x16/0x1b
[6.655631]
[6.655631]
[6.655631] Code: 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 89 6d f8 8b 47 08 
49 89 d5 49 89 fc 89 c2 25 ff ff 00 00 c1 ea 10 39 c2 74 1d 49 8b 4
[6.655631] RIP  [8025e302] debug_mutex_add_waiter+0x32/0x80
[6.655631]  RSP 88021a4e7998
[6.655631] CR2: 0858
[6.655631] ---[ end trace 8bbd31df1403e48e ]---
[7.024992] modprobe used greatest stack depth: 408 bytes left
[7.030988] BUG: unable to handle kernel NULL pointer dereference at 
0048
[7.031053] IP: [8023f39c] do_exit+0x28c/0xa10
[7.031053] PGD 0
[7.031053] Oops:  [2] SMP
[7.031053] CPU 1
[7.031053] Modules linked in: processor(+) fan thermal_sys fuse
[7.031053] Pid: 1259, comm: modprobe Tainted: G  D   2.6.27-rc3 

Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Alan D. Brunelle
Linus Torvalds wrote:
 
 On Sat, 23 Aug 2008, Linus Torvalds wrote:
 This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but 
 then the call chain shows that there is no interrupt going on.
 
 Ahh, later in that thread there's another totally unrelated oops in 
 debug_mutex_add_waiter().
 
 I'd guess that it is really wild pointer corrupting memory, quite possibly 
 due to a double free or something like that. Alan - it would be good to 
 run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do 
 that?

With /just/ DEBUG_PAGE_ALLOC defined, I have seen two general panic types:

o  A new double fault w/ SMP_DEBUG_PAGEALLOC problem (prob4.txt)

o  The NULL pointer dereference @ 0x858 (prob4a.txt)

Enabling SLUB debugging to see what that shows

Alan
Begin: Loading essential drivers... ...
[6.680626] fuse init (API version 7.9)
[6.680626] modprobe used greatest stack depth: 1720 bytes left
[6.704224] double fault:  [1] SMP DEBUG_PAGEALLOC
[6.704224] CPU 1
[6.704224] Modules linked in: processor(+) fan thermal_sys fuse
[6.710629] Pid: 1259, comm: modprobe Not tainted 2.6.27-rc3 #30
[6.710629] RIP: 0010:[802214dc]  [802214dc] 
flat_send_IPI_allbutself+0x2c/0x80
[6.710629] RSP: 0018:88021a513ff8  EFLAGS: 00010282
[6.710629] RAX: 805f5520 RBX: 88021a5141f8 RCX: 003f
[6.710629] RDX: 0200 RSI: 80bf7920 RDI: 88021a5141f8
[6.710629] RBP: 88021a514408 R08: 0040 R09: 0040
[6.710629] R10: 1000 R11:  R12: 00fc
[6.710629] R13: 88021a514eb8 R14: 0001 R15: 8021cba0
[6.710629] FS:  7f39acc136e0() GS:88022fc81a00() 
knlGS:
[6.710629] CS:  0010 DS:  ES:  CR0: 8005003b
[6.710629] CR2: 88021a513fe8 CR3: 00021a469000 CR4: 06e0
[6.710629] DR0:  DR1:  DR2: 
[6.710629] DR3:  DR6: 0ff0 DR7: 0400
[6.710629] Process modprobe (pid: 1259, threadinfo 88021a514000, task 
88021a9da050)
[6.710629] Stack: 1BUG: unable to handle kernel paging request at 
88021a513ff8
[6.710629] IP: [8020dd22] show_stack_log_lvl+0x82/0x130
[6.710629] PGD 202063 PUD 10067 PMD 22f176163 PTE 80021a513160
[6.710629] Oops:  [2] SMP DEBUG_PAGEALLOC
[6.710629] CPU 1
[6.710629] Modules linked in: processor(+) fan thermal_sys fuse
[6.710629] Pid: 1259, comm: modprobe Not tainted 2.6.27-rc3 #30
[6.710629] RIP: 0010:[8020dd22]  [8020dd22] 
show_stack_log_lvl+0x82/0x130
[6.710629] RSP: 0018:88022f12ce28  EFLAGS: 00010046
[6.710629] RAX: 88022fc81a00 RBX: 88021a513ff8 RCX: 000c
[6.710629] RDX: 88021a513ff8 RSI: 88022f12cf58 RDI: 
[6.710629] RBP: 88022f12ce78 R08: 8059aea9 R09: 0001
[6.710629] R10:  R11:  R12: 
[6.710629] R13: 88022f127fc0 R14: 88022f12bfc0 R15: 
[6.710629] FS:  7f39acc136e0() GS:88022fc81a00() 
knlGS:
[6.710629] CS:  0010 DS:  ES:  CR0: 8005003b
[6.710629] CR2: 88021a513ff8 CR3: 00021a469000 CR4: 06e0
[6.710629] DR0:  DR1:  DR2: 
[6.710629] DR3:  DR6: 0ff0 DR7: 0400
[6.710629] Process modprobe (pid: 1259, threadinfo 88021a514000, task 
88021a9da050)
[6.710629] Stack:  88022f12ce78 8059aea9 88021a514408 
88022f12cf58
[6.710629]  88021a513ff8 88021a9da050  
88022f12cf58
[6.710629]  0040 002b 88022f12ceb8 
8020dea8
[6.710629] Call Trace:
[6.710629]  #DF  [8020dea8] show_registers+0xd8/0x260
[6.710629]  [8021cba0] ? do_flush_tlb_all+0x0/0x40
[6.710629]  [804a2083] __die+0xa3/0x120
[6.710629]  [8020e073] die+0x43/0x90
[6.710629]  [8020e1e3] do_double_fault+0x63/0x70
[6.710629]  [8020d4fd] double_fault+0x7d/0x90
[6.710629]  [8021cba0] ? do_flush_tlb_all+0x0/0x40
[6.710629]  [802214dc] ? flat_send_IPI_allbutself+0x2c/0x80
[6.710629]  EOE
[6.710629]
[6.710629] Code: 55 d0 85 c9 48 89 d3 7e 5a 45 31 e4 eb 44 4c 39 f3 77 44 
66 0f 1f 44 00 00 74 7e 45 85 e4 74 0b 41 f6 c4 03 0f 1f 44 00 00 7
[6.720629] RIP  [8020dd22] show_stack_log_lvl+0x82/0x130
[6.720629]  RSP 88022f12ce28
[6.720629] CR2: 88021a513ff8
[6.720629] ---[ end trace d4fdf12ff3e07cc3 ]---
[7.052961] modprobe used greatest stack depth: 920 bytes left
Killed
[6.551876] all_generic_ide used greatest stack depth: 4784 

Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Alan D. Brunelle
Adding in SLUB debugging doesn't show anything new (I think). Example
boot log (w/ initcall_debug enabled) is at:

http://free.linux.hp.com/~adb/bug.11342/prob5.txt

This has happened 3 times in a row as well. Whilst this is being looked
at, I'm going to fast-forward ahead to the latest in Linus' tree, and
see if the problem is still occurring (I think Linus' point earlier
about some sort of rogue timing and/or corruption bug is spot on, but
it's probably better to see how close to today's tree I can reproduce
this). I'll also try kernels w/ the problematic merge patch backed out
to see if that still fixes (or more likely(?) just patches over the
real problem).

Alan
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Alan D. Brunelle
I built a kernel @

commit 83097aca8567a0bd593534853b71fe0fa9a75d69
Author: Arjan van de Ven [EMAIL PROTECTED]
Date:   Sat Aug 23 21:45:21 2008 -0700

And it fails like the others do

o  http://free.linux.hp.com/~adb/bug.11342/prob6.txt

SMP_DEBUG_PAGEALLOC

o  http://free.linux.hp.com/~adb/bug.11342/prob6a.txt

[7.591198] BUG: unable to handle kernel NULL pointer dereference at
0858



I then backed out /just/ the merge for

commit 1c89ac55017f982355c7761e1c912c88c941483d
Merge: 88fa08f... b1b135c...
Author: Linus Torvalds [EMAIL PROTECTED]
Date:   Tue Aug 12 08:40:19 2008 -0700

And the machine has booted fine 5 times in a row.



I've put the latest .config up at

http://free.linux.hp.com/~adb/bug.11342/config.txt


Is there /some/ way to break down the patches within the merged patch,
and I could by-hand bisect through those?

Here's what I did to take the latest tree, and back out that merge (to
get booting kernels):

git-diff
88fa08f67bee1a0c765237bdac106a32872f57d2..1c89ac55017f982355c7761e1c912c88c941483d
| patch -p1 -R
patching file Documentation/lguest/lguest.c
patching file arch/powerpc/Kconfig
patching file arch/x86/Kconfig
patching file arch/x86/mm/Makefile
patching file drivers/char/hvc_console.c
patching file drivers/lguest/page_tables.c
patching file include/linux/Kbuild
Hunk #1 succeeded at 358 (offset 2 lines).
patching file include/linux/init.h
patching file include/linux/mm.h
patching file init/main.c
patching file kernel/module.c
patching file kernel/stop_machine.c
patching file mm/Kconfig
patching file mm/util.c

Alan
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Linus Torvalds


On Mon, 25 Aug 2008, Alan D. Brunelle wrote:

 Before adding any more debugging, this is the status of my kernel boots:
 3 times in a row w/ this same error. (Primary problem is the same,
 secondary stacks differ of course.)

Ok, so I took a closer look, and the oops really is suggestive..

 [6.482953] busybox used greatest stack depth: 4840 bytes left

Ok, 4840 bytes left out of 8kB.

 [6.521876] all_generic_ide used greatest stack depth: 4784 bytes left

.. and this one is 4784 bytes left..

 Begin: Loading essential drivers... ...
 [6.625509] fuse init (API version 7.9)
 [6.625509] modprobe used greatest stack depth: 1720 bytes left

Uhhuh! The previous modprobe uses stack like mad.  It could be 
fuse_init() that has done it, but looking at fuse, I seriously doubt it. 
It doesn't seem to do anything particularly bad.

So something has used over 6kB of stack, and it may well be the module 
loading code itself.

The next stage is the actual oops itself:

 [6.644854] ACPI: SSDT CFFD0D0A, 08C4 (r1 HPQOEM  CPU_TM21 MSFT  
 10E)
 [6.651489] BUG: unable to handle kernel NULL pointer dereference at 
 0858

This really looks like

ti-task-blocked_on = waiter;

where ti-task is NULL. You probably have almost everything enabled in 
order to turn struct task_struct that big, but judging by your register 
state it's really an offset off a NULL pointer, not some small integer.

Now, there is no way ti-task can _possibly_ be NULL. No way.

Well, except that ti is just below the stack, and if you had a stack 
overflow that overwrote it.

So I seriously do believe that you have run out of stack. If that is true, 
then it's quite likely that with DEBUG_PAGE_ALLOC you'll actually get a 
double fault, which in turn is fairly hard to debug (you look at it wrong 
and it turns into a triple fault which is going to just reboot your 
machine immediately).

Now, the stack oveflow probably happened a few calls earlier (and just 
left your thread_info corrupted), but there is more reason to believe you 
have stack overflow and thread_info corruption later in your output:

 [7.024992] modprobe used greatest stack depth: 408 bytes left  
 [7.030988] BUG: unable to handle kernel NULL pointer dereference at 
 0048
 [7.031053] IP: [8023f39c] do_exit+0x28c/0xa10

Here there is only 408 bytes left, which is _way_ too little, but it's 
also an optimistic measure. What the stack code usage code does is to just 
see how many zeroes it can find on the stack. If you have a big stack 
frame somewhere, it's quite possible that it actually used all your stack 
and then some, but left a bunch of zeroes around.

And the do_exit() oops is simply because once the thread_info is 
corrupted, all the basic thread data structures are crap, and yes, you're 
almost guaranteed to oops at that point.

Could you make your kernel image available somewhere, and we can take a 
look at it? Some versions of gcc are total pigs when it comes to stack 
usage, and your exact configuration matters too.  But yes, module loading 
is a bad case, for me sys_init_module() contains

subq$392, %rsp  #,

which is probably mostly because of the insane inlining gcc does (ie it 
will likely have inlined every single function in that file that is only 
called once, and then it will make all local variables of all those 
functions alive over the whole function and allocate stack-space for them 
ALL AT THE SAME TIME).

Gcc sometimes drives me mad. It's inlining decisions are almost always 
pure and utter sh*t. But clearly something changed for you to start 
triggering this, and I think that also explains why you bisected things to 
the merge commit rather than to any individual change - because it was 
probably not any individual change that pushed it over the limit, but two 
different changes that made for bigger stack pressure, and _together_ they 
pushed you over the limit.

So it also explains why the merge you found had no possible merge errors 
on a source level - there were no actual clashes anywhere. Just a slow 
growth of stack that combined to something that overflowed.

And yes, I bet the change by Arjan to use do_one_initcall() was _part_ of 
it. It adds roughly 112 bytes of stack pressure to that module loading 
path, because of the 64-byte array and the extra function call (8 bytes 
for return address) with at least 5 quad-words saved (40 bytes) for 
register spills.

But there were probably other things happening too that made things worse.

So if there is some place where you can upload your 'vmlinux' binary, it 
would be good.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Linus Torvalds


On Mon, 25 Aug 2008, Alan D. Brunelle wrote:
 
 With /just/ DEBUG_PAGE_ALLOC defined, I have seen two general panic types:
 
 o  A new double fault w/ SMP_DEBUG_PAGEALLOC problem (prob4.txt)

Yeah, that's a stack overflow.

Confirmed.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Linus Torvalds


On Mon, 25 Aug 2008, Linus Torvalds wrote:
 
 Could you make your kernel image available somewhere, and we can take a 
 look at it? Some versions of gcc are total pigs when it comes to stack 
 usage, and your exact configuration matters too.  But yes, module loading 
 is a bad case, for me sys_init_module() contains
 
   subq$392, %rsp  #,
 
 which is probably mostly because of the insane inlining gcc does (ie it 
 will likely have inlined every single function in that file that is only 
 called once, and then it will make all local variables of all those 
 functions alive over the whole function and allocate stack-space for them 
 ALL AT THE SAME TIME).

I bet this one-liner will probably make your kernel work. It's not a full 
solution, but it will make the module-loading path lose _all_ of the above 
stack slots by just not inlining load_module() - the stack slots will 
still be used when the module is _loaded_, but by the time we actually 
callt he -init function they will have been released since it's not all 
in the same crazy function any more.

I _seriously_ believe that we were better off back when gcc only inlined 
what we told it to inline, and never inlined on its own. The gcc inlining 
logic is pure and utter sh*t in an environment like the kernel where stack 
space is a valuable resource.

Anyway, Alan, even if this solves your particular problem, I'd still like 
to see your kernel image, so that I can hunt for other problems like 
this..

Linus

---
 kernel/module.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 08864d2..9db1191 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1799,7 +1799,7 @@ static void *module_alloc_update_bounds(unsigned long 
size)
 
 /* Allocate and load the module: note that size of section 0 is always
zero, and we rely on this for optional sections. */
-static struct module *load_module(void __user *umod,
+static noinline struct module *load_module(void __user *umod,
  unsigned long len,
  const char __user *uargs)
 {
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Alan D. Brunelle
Linus Torvalds wrote:
 
 On Mon, 25 Aug 2008, Linus Torvalds wrote:
 Could you make your kernel image available somewhere, and we can take a 
 look at it? Some versions of gcc are total pigs when it comes to stack 
 usage, and your exact configuration matters too.  But yes, module loading 
 is a bad case, for me sys_init_module() contains

  subq$392, %rsp  #,

 which is probably mostly because of the insane inlining gcc does (ie it 
 will likely have inlined every single function in that file that is only 
 called once, and then it will make all local variables of all those 
 functions alive over the whole function and allocate stack-space for them 
 ALL AT THE SAME TIME).

Mine has:

Dump of assembler code for function sys_init_module:
0x802688c0 sys_init_module+0: push   %rbp
0x802688c1 sys_init_module+1: mov%rsp,%rbp
0x802688c4 sys_init_module+4: sub$0x1c0,%rsp
0x802688cb sys_init_module+11:mov%r12,-0x20(%rbp)
0x802688cf sys_init_module+15:mov%rdi,%r12

so 448 bytes.

The kernel is up at: http://free.linux.hp.com/~adb/bug.11342/vmlinux (if
you would let me know when you are through with it so I can free up some
space there I'd appreciate it...)

By doing the patch you provided, sys_init_module now looks like:

Dump of assembler code for function sys_init_module:
0x8026aa20 sys_init_module+0: push   %rbp
0x8026aa21 sys_init_module+1: mov%rsp,%rbp
0x8026aa24 sys_init_module+4: sub$0x20,%rsp
0x8026aa28 sys_init_module+8: mov%r14,0x18(%rsp)
0x8026aa2d sys_init_module+13:mov%rdi,%r14


So only 32 bytes. (But of course, load_module() exists, and now has
0x1d0 (464) bytes...)

With the patch you provide, I /was/ able to repeatedly boot OK (latest
tree, and I also ran the patch against the 26.27.rc3-based kernel I was
having problems with initially, and that booted OK as well).

Alan

 
 I bet this one-liner will probably make your kernel work. It's not a full 
 solution, but it will make the module-loading path lose _all_ of the above 
 stack slots by just not inlining load_module() - the stack slots will 
 still be used when the module is _loaded_, but by the time we actually 
 callt he -init function they will have been released since it's not all 
 in the same crazy function any more.
 
 I _seriously_ believe that we were better off back when gcc only inlined 
 what we told it to inline, and never inlined on its own. The gcc inlining 
 logic is pure and utter sh*t in an environment like the kernel where stack 
 space is a valuable resource.
 
 Anyway, Alan, even if this solves your particular problem, I'd still like 
 to see your kernel image, so that I can hunt for other problems like 
 this..
 
   Linus
 
 ---
  kernel/module.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/kernel/module.c b/kernel/module.c
 index 08864d2..9db1191 100644
 --- a/kernel/module.c
 +++ b/kernel/module.c
 @@ -1799,7 +1799,7 @@ static void *module_alloc_update_bounds(unsigned long 
 size)
  
  /* Allocate and load the module: note that size of section 0 is always
 zero, and we rely on this for optional sections. */
 -static struct module *load_module(void __user *umod,
 +static noinline struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
  {
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Linus Torvalds

On Mon, 25 Aug 2008, Alan D. Brunelle wrote:
 
 Mine has:
 
 Dump of assembler code for function sys_init_module:
 0x802688c4 sys_init_module+4:   sub$0x1c0,%rsp
 
 so 448 bytes.

Yeah, your build seems to have consistently bigger stack usage, and that 
may be due to some config option, but most likely it's a compiler version 
issue.

But I think part of the reason is that you have frame pointers enabled: 
that makes the stack frames bigger not only because of the frame pointer 
save/restore, but also because you have more register pressure and thus 
spills.

 The kernel is up at: http://free.linux.hp.com/~adb/bug.11342/vmlinux (if
 you would let me know when you are through with it so I can free up some
 space there I'd appreciate it...)

I'm downloading it now, I'll probably be done by the time you get this 
email.

[ Update. Done. You can remove it ]

 By doing the patch you provided, sys_init_module now looks like:
 
 Dump of assembler code for function sys_init_module:
 0x8026aa24 sys_init_module+4:   sub$0x20,%rsp
 
 So only 32 bytes. (But of course, load_module() exists, and now has
 0x1d0 (464) bytes...)

Right - the stack usage didn't go away, but the _lifetimes_ changed.

So now load_module() will still use almost 500 bytes of stack, and it will 
call other routines that use stack too, but the lifetime of that stack 
usage is no longer over the whole module loading and initialization part, 
it's purely over just the loading thing.

And since the deep callchain came much later (in the actual -init 
routines), by the time we do that, we no longer now have the load_module 
stack usage active any more.

 With the patch you provide, I /was/ able to repeatedly boot OK (latest
 tree, and I also ran the patch against the 26.27.rc3-based kernel I was
 having problems with initially, and that booted OK as well).

I had actually already committed it, because it was correct regardless 
(and gcc really is a total ass for doing that inlining to begin with), but 
it's good to have verification that the behaviour you saw was literally 
about this thing.

I'll look at your vmlinux binary to see what else sucks from a stack depth 
standpoint, but one of the problems in this whole thing is that the 
stack usage is obviously both a static thing (with some functions using 
_way_ too much stack!) _and_ a dynamic thing (with the total stack use 
being not about any individual function, but the whole chain).

My patch obviously doesn't change the static stack usage, it just moves it 
around a bit so that it's no longer on that same deep path, so the dynamic 
stack usage is much less.

But I'll look at your vmlinux, see what stands out.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Arjan van de Ven

Linus Torvalds wrote:

On Mon, 25 Aug 2008, Alan D. Brunelle wrote:

Mine has:

Dump of assembler code for function sys_init_module:
0x802688c4 sys_init_module+4:   sub$0x1c0,%rsp

so 448 bytes.


Yeah, your build seems to have consistently bigger stack usage, and that 
may be due to some config option, but most likely it's a compiler version 
issue.




I wonder if we ought to have a light version of make checkstack always run,
but in such a way that we make a file with limits on the stack usage for key
functions (and we can grow this list over time when we learn about critical 
ones)..
and either warn very loudly or even fail the build if we're way over what could 
work.

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Linus Torvalds


On Mon, 25 Aug 2008, Linus Torvalds wrote:
 
 But I'll look at your vmlinux, see what stands out.

Oops. I already see the problem.

Your .config has soem _huge_ CPU count, doesn't it?

checkstack.pl shows these things as the top problems:

0x80266234 smp_call_function_mask [vmlinux]:2736
0x80234747 __build_sched_domains [vmlinux]: 2232
0x8023523f __build_sched_domains [vmlinux]: 2232
0x8021e884 setup_IO_APIC_irq [vmlinux]: 1616
0x8021ee24 arch_setup_ht_irq [vmlinux]: 1600
0x8021f144 arch_setup_msi_irq [vmlinux]:1600
0x8021e3b0 __assign_irq_vector [vmlinux]:   1592
0x8021e626 __assign_irq_vector [vmlinux]:   1592
0x8023257e move_task_off_dead_cpu [vmlinux]:1592
0x802326e8 move_task_off_dead_cpu [vmlinux]:1592
0x8025dbc5 tick_handle_oneshot_broadcast [vmlinux]:1544
0x8025dcb4 tick_handle_oneshot_broadcast [vmlinux]:1544
0x803f3dc4 store_scaling_governor [vmlinux]:1376
0x80279ef4 cpuset_write_resmask [vmlinux]:  1360
0x803f465d cpufreq_add_dev [vmlinux]:   1352
0x803f495b cpufreq_add_dev [vmlinux]:   1352
0x803f3fc4 store_scaling_max_freq [vmlinux]:1328
0x803f4064 store_scaling_min_freq [vmlinux]:1328
0x803f44c4 cpufreq_update_policy [vmlinux]: 1328
..

and sys_init_module is actually way way down the list. I bet the only 
reason it showed up at all was because dynamically it was such a deep 
callchain, and part of that callchain probably called some of those really 
nasty things.

Anyway, the reason smp_call_function_mask and friends have such _huge_ 
stack usages for you is that they contain a 'cpumask_t' on the stack.

For example, for me, usign a sane NR_CPU, the size of the stack frame for 
smp_call_function_mask is under 200 bytes.  For you, it's 2736 bytes.

How about you make CONFIG_NR_CPU's something _sane_? Like 16? Or do you 
really have four thousand CPU's in that system?

Oh, I guess you have the MAXSMP config enabled? I really think that was a 
bit too aggressive.

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Linus Torvalds


On Mon, 25 Aug 2008, Linus Torvalds wrote:
 
 checkstack.pl shows these things as the top problems:
 
   0x80266234 smp_call_function_mask [vmlinux]:2736
   0x80234747 __build_sched_domains [vmlinux]: 2232
   0x8023523f __build_sched_domains [vmlinux]: 2232
 
 Anyway, the reason smp_call_function_mask and friends have such _huge_ 
 stack usages for you is that they contain a 'cpumask_t' on the stack.

In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in 
size. And they tend to call each other.

Quite frankly, I don't think we were really ready for 4k CPU's. I'm going 
to commit this patch to make sure others don't do that many CPU's by 
mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and 
also limits the number of CPU's that you _can_ select to just 512.

Right now, 4k cpu's is known broken because of the stack usage. I'm not 
willing to debug more of these kinds of stack smashers, they're really 
nasty to work with. I wonder how many other random failures these have 
been involved with?

This patch also makes the ifdef mess in Kconfig much cleaner and avoids 
duplicate definitions by just conditionally suppressing the question and 
giving higher defaults. 

We can enable MAXSMP and raise the CPU limits some time in the future. But 
that future is not going to be before 2.6.27 - the code simply isn't ready 
for it.

The reason I picked 512 CPU's as the limit is that we _used_ to limit 
things to 255. So it's higher than it used to be, but low enough to still 
feel safe. Considering that a 4k-bit CPU mask (512 bytes) _almost_ worked, 
the 512-bit (64 bytes) masks are almost certainly fine.

Still, sane people should limit their NR_CPUS to 8 or 16 or something like 
that. Very very few people really need the pain of big NR_CPUS. Not even 
just 512 CPU's.

Travis, Ingo and Thomas cc'd, since they were involved in the original 
commit (1184dc2ffe2c8fb9afb766d870850f2c3165ef25) that raised the limit.

Linus

---
 arch/x86/Kconfig |   30 --
 1 files changed, 8 insertions(+), 22 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68d91c8..ed92864 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -577,35 +577,29 @@ config SWIOTLB
 
 config IOMMU_HELPER
def_bool (CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU)
+
 config MAXSMP
bool Configure Maximum number of SMP Processors and NUMA Nodes
-   depends on X86_64  SMP
+   depends on X86_64  SMP  BROKEN
default n
help
  Configure maximum number of CPUS and NUMA Nodes for this architecture.
  If unsure, say N.
 
-if MAXSMP
-config NR_CPUS
-   int
-   default 4096
-endif
-
-if !MAXSMP
 config NR_CPUS
-   int Maximum number of CPUs (2-4096)
-   range 2 4096
+   int Maximum number of CPUs (2-512) if !MAXSMP
+   range 2 512
depends on SMP
+   default 4096 if MAXSMP
default 32 if X86_NUMAQ || X86_SUMMIT || X86_BIGSMP || X86_ES7000
default 8
help
  This allows you to specify the maximum number of CPUs which this
- kernel will support.  The maximum supported value is 4096 and the
+ kernel will support.  The maximum supported value is 512 and the
  minimum value which makes sense is 2.
 
  This is purely to save memory - each supported CPU adds
  approximately eight kilobytes to the kernel image.
-endif
 
 config SCHED_SMT
bool SMT (Hyperthreading) scheduler support
@@ -996,17 +990,10 @@ config NUMA_EMU
  into virtual nodes when booted with numa=fake=N, where N is the
  number of nodes. This is only useful for debugging.
 
-if MAXSMP
-
 config NODES_SHIFT
-   int
-   default 9
-endif
-
-if !MAXSMP
-config NODES_SHIFT
-   int Maximum NUMA Nodes (as a power of 2)
+   int Maximum NUMA Nodes (as a power of 2) if !MAXSMP
range 1 9   if X86_64
+   default 9 if MAXSMP
default 6 if X86_64
default 4 if X86_NUMAQ
default 3
@@ -1014,7 +1001,6 @@ config NODES_SHIFT
help
  Specify the maximum number of NUMA Nodes available on the target
  system.  Increases memory reserved to accomodate various tables.
-endif
 
 config HAVE_ARCH_BOOTMEM_NODE
def_bool y
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-25 Thread Christoph Lameter
Alan D. Brunelle wrote:

 I think you're right: the kernel as a whole may not be ready for 4,096
 CPUs apparently...

Mike has been working diligently on getting all these cpumasks off the stack
for the last months and has created an infrastructure to do this. So I think
we are close. It might just be a matter of merging some more patches that are
still left in Ingo's tree.

--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-23 Thread Linus Torvalds


On Sat, 23 Aug 2008, Rafael J. Wysocki wrote:
 
 The following bug entry is on the current list of known regressions
 from 2.6.26.  Please verify if it still should be listed and let me know
 (either way).
 
 
 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11342
 Subject   : Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - 
 bisected
 Submitter : Alan D. Brunelle [EMAIL PROTECTED]
 Date  : 2008-08-13 23:03 (11 days old)
 References: http://marc.info/?l=linux-kernelm=121866876027629w=4
 Handled-By: Andrew Morton [EMAIL PROTECTED]

This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but 
then the call chain shows that there is no interrupt going on.

Also, the bisection is senseless - there's a trivial change wrt 
do_one_initcall() that got merged, but everything else is trivial about 
lguest and has nothing to do with the whole CPU-init thing. But if it was 
that initcall one, then git bisect woul have pointed to it, not the 
merge. And the merge itself had no conflicts or anything else going on..

The fact that it came and went later also implies that it's probably just 
some timing-dependent thing or some subtle memory corruption, making the 
bisection result even less likely to be exact.

But I'm adding Arjan and Rusty to the Cc, because that merge was takign 
Rusty's branch, and the do_one_initcall() is Arjan's commit. Since 
undoing that merge apparently does fix it, I'm wondering if something 
there just does end up triggering the problem.

The do_one_commit() thing _is_ in the path of sys_init_module(), so it 
_is_ at least somewhat relevant from an oops standpoint. 

One thing the do_one_commit() thing does is to put more pressure on the 
stack due to that whole buffer for the printk's going on.

Alan, can you try 
 - seeing how consistent it is with one kernel (ie boot a known-bad kernel 
   a few times just to see if it really is 100% consistent)
 - try enabling 'initcall_debug' on the kernel command line, to (a) see 
   the new code actually do something and (b) see what it is actually 
   calling just before.

Hmm..

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug #11342] Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected

2008-08-23 Thread Linus Torvalds


On Sat, 23 Aug 2008, Linus Torvalds wrote:
 
 This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but 
 then the call chain shows that there is no interrupt going on.

Ahh, later in that thread there's another totally unrelated oops in 
debug_mutex_add_waiter().

I'd guess that it is really wild pointer corrupting memory, quite possibly 
due to a double free or something like that. Alan - it would be good to 
run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do 
that?

Linus
--
To unsubscribe from this list: send the line unsubscribe kernel-testers in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html