Bug#603229: Further information

2010-11-27 Thread Ben Hutchings
On Tue, 2010-11-23 at 13:17 +0100, Frede Feuerstein wrote:
 Hi !
 
  This shows something about what's going wrong.  Could you please try
  adding 'debug' to the kernel parameters?  That will show some more
  context for these errors.
 
 I booted 2.6.32-5 with the debug option on, and for comparison did the
 same with 2.6.30-2.
 
 The errors concerning the power management itself are also showing up in
 2.6.30-2.

The error message about 'domain-cpu_power' does not refer to power
management, but to the scheduler's estimation of the processing power of
each group of processor threads.

The scheduler is trying to group the processor threads by:

- NUMA node (NODE; sharing a connection to RAM)
- Package (CPU; sharing some caches)
- Core (MC; sharing execution units)

so that it can make good decisions about where a task should run when it
is ready to do so.

 But whereas 2.6.32-5 afterwards crashes with a divide error,
 2.6.30-2 starts up normally:
[...]
 I suppose that it is the divide error in [0.852154], we have to deal
 with.
[...]

The division by zero appears to be a result of getting bad information
from the firmware about the groups of processors.  I realise that this
same bad information did not previously result in a crash, but I (and
the upstream developers) need to know what that information is before we
can understand how this can be avoided.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.


signature.asc
Description: This is a digitally signed message part


Bug#603229: Further information

2010-11-27 Thread Frede Feuerstein
Hi !

 The error message about 'domain-cpu_power' does not refer to power
 management, but to the scheduler's estimation of the processing power of
 each group of processor threads.
 
 The scheduler is trying to group the processor threads by:
 
 - NUMA node (NODE; sharing a connection to RAM)
 - Package (CPU; sharing some caches)
 - Core (MC; sharing execution units)

So lets start here: On this machine NUMA node and Package are identical:
CPU0 / CPU1 are one group and CPU2 / CPU3 is the other.
As for all Socket 940 Opterons, the cores logically are complete CPUs
i.e. do not share execution units.

 so that it can make good decisions about where a task should run when it
 is ready to do so.
 
  But whereas 2.6.32-5 afterwards crashes with a divide error,
  2.6.30-2 starts up normally:
 [...]
  I suppose that it is the divide error in [0.852154], we have to deal
  with.
 [...]
 
 The division by zero appears to be a result of getting bad information
 from the firmware about the groups of processors.

Well, technically a division error always is a result of bad data fed to
that division. I rather meant, that this is the point to backtrace the
error.
Though the bios of the w2100z is known for some problems, the cpus are
reported correctly by the bios and it is the latest version (R01-B5-S1).

   I realise that this
 same bad information did not previously result in a crash, but I (and
 the upstream developers) need to know what that information is before we
 can understand how this can be avoided.

Are there any means to gather more information ? Tell me and i shall do
it. 

Tilo






-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/1290920436.4255.1025.ca...@localhost



Bug#603229: Further information

2010-11-23 Thread Frede Feuerstein
Hi !

 This shows something about what's going wrong.  Could you please try
 adding 'debug' to the kernel parameters?  That will show some more
 context for these errors.

I booted 2.6.32-5 with the debug option on, and for comparison did the
same with 2.6.30-2.

The errors concerning the power management itself are also showing up in
2.6.30-2. But whereas 2.6.32-5 afterwards crashes with a divide error,
2.6.30-2 starts up normally:

Here the according section of the 2.6.30-2 protocol (if You would like
to obtain the complete protocol, just tell me) :

[0.692003] ERROR: groups don't span domain-span
[0.696004] CPU3 attaching sched-domain:
[0.73]  domain 0: span 2-3 level MC
[0.708003]   groups: 3 2
[0.720003]   domain 1: span 1-3 level CPU
[0.728003]groups: 2-3 1
[0.744003]domain 2: span 0-3 level NODE
[0.752003] groups: 1-3 (__cpu_power = 2048)
[0.766180] ERROR: domain-cpu_power not set
[0.768003] 
[0.772003] ERROR: groups don't span domain-span
[0.776172] net_namespace: 1936 bytes
[0.780049] Booting paravirtualized kernel on bare hardware
[0.784149] regulator: core version 0.5
[0.788114] NET: Registered protocol family 16
[0.792037] node 0 link 2: io port [0, 2fff]
[0.796004] node 0 link 0: io port [3000, 3fff]
[0.84] TOM: 8000 aka 2048M
[0.804004] node 0 link 2: mmio [d000, d04f]
=

And here the complete protocol of the 2.6.32-5 crashing:
=
bash-3.00$ tip hardwire
connected
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 2.6.32-5-amd64 (Debian 2.6.32-27)
(m...@debian.org) (gcc version 4 .3.5 (Debian 4.3.5-4) ) #1 SMP Sat Oct
30 14:18:21 UTC 2010
[0.00] Command line: root=LABEL=S_rt ro debug console=ttyS0
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009d400 (usable)
[0.00]  BIOS-e820: 0009d400 - 000a
(reserved)
[0.00]  BIOS-e820: 000ce000 - 0010
(reserved)
[0.00]  BIOS-e820: 0010 - 7ff6 (usable)
[0.00]  BIOS-e820: 7ff6 - 7ff72000 (ACPI
data)
[0.00]  BIOS-e820: 7ff72000 - 7ff8 (ACPI
NVS)
[0.00]  BIOS-e820: 7ff8 - 8000
(reserved)
[0.00]  BIOS-e820: fec0 - fec00400
(reserved)
[0.00]  BIOS-e820: fee0 - fee01000
(reserved)
[0.00]  BIOS-e820: fff8 - 0001
(reserved)
[0.00] DMI present.
[0.00] last_pfn = 0x7ff60 max_arch_pfn = 0x4
[0.00] MTRR default type: uncachable
[0.00] MTRR fixed ranges enabled:
[0.00]   0-9 write-back
[0.00]   A-B uncachable
[0.00]   C-D1FFF write-protect
[0.00]   D2000-E7FFF uncachable
[0.00]   E8000-F write-protect
[0.00] MTRR variable ranges enabled:
[0.00]   0 base 00 mask FF8000 write-back
[0.00]   1 disabled
[0.00]   2 disabled
[0.00]   3 disabled
[0.00]   4 disabled
[0.00]   5 disabled
[0.00]   6 disabled
[0.00]   7 disabled
[0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new
0x7010600070106
[0.00] initial memory mapped : 0 - 2000
[0.00] init_memory_mapping: -7ff6
[0.00]  00 - 007fe0 page 2M
[0.00]  007fe0 - 007ff6 page 4k
[0.00] kernel direct mapping tables up to 7ff6 @ 8000-c000
[0.00] RAMDISK: 375e7000 - 37fef333
[0.00] ACPI: RSDP 000f7200 00024 (v02 PTLTD )
[0.00] ACPI: XSDT 7ff6d424 00044 (v01 PTLTD  ? XSDT
0604  LTP )
[0.00] ACPI: FACP 7ff71a2a 000F4 (v03 SUNSUNmetro
0604 PTEC 000F4240)
[0.00] ACPI: DSDT 7ff6d468 0454E (v01SUNK85AE
0604 MSFT 010D)
[0.00] ACPI: FACS 7ff72fc0 00040
[0.00] ACPI: SRAT 7ff71b1e 000C8 (v01 AMDHAMMER
0604 AMD  0001)
[0.00] ACPI: APIC 7ff71be6 000AA (v01 PTLTD  ? APIC
0604  LTP )
[0.00] ACPI: SSDT 7ff71c90 00370 (v01 PTLTD  POWERNOW
0604  LTP 0001)
[0.00] ACPI: Local APIC address 0xfee0
[0.00] SRAT: PXM 0 - APIC 0 - Node 0
[0.00] SRAT: PXM 1 - APIC 1 - Node 1
[0.00] SRAT: Node 0 PXM 0 0-a
[0.00] SRAT: Node 0 PXM 0 10-4000
[0.00] SRAT: Node 1 PXM 1 4000-8000
[

Bug#603229: Further information

2010-11-22 Thread Tilo Hacke
Hi !


I just have tried the last 2.6.31-2 an it is working flawlessly.

Further i have set up a serial connection to my SB1500 and so got an
protocol of the boot process and the crash:

===

bash-3.00$ tip hardwire
connected
[0.00] Initializing cgroup subsys cpuset
[0.00] Initializing cgroup subsys cpu
[0.00] Linux version 2.6.32-5-amd64 (Debian 2.6.32-27)
(m...@debian.org) (gcc version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Sat Oct
30 14:18:21 UTC 2010
[0.00] Command line: root=LABEL=S_rt ro console=ttyS0
[0.00] KERNEL supported cpus:
[0.00]   Intel GenuineIntel
[0.00]   AMD AuthenticAMD
[0.00]   Centaur CentaurHauls
[0.00] BIOS-provided physical RAM map:
[0.00]  BIOS-e820:  - 0009d400 (usable)
[0.00]  BIOS-e820: 0009d400 - 000a
(reserved)
[0.00]  BIOS-e820: 000ce000 - 0010
(reserved)
[0.00]  BIOS-e820: 0010 - 7ff6 (usable)
[0.00]  BIOS-e820: 7ff6 - 7ff72000 (ACPI
data)
[0.00]  BIOS-e820: 7ff72000 - 7ff8 (ACPI
NVS)
[0.00]  BIOS-e820: 7ff8 - 8000
(reserved)
[0.00]  BIOS-e820: fec0 - fec00400
(reserved)
[0.00]  BIOS-e820: fee0 - fee01000
(reserved)
[0.00]  BIOS-e820: fff8 - 0001
(reserved)
[0.00] DMI present.
[0.00] last_pfn = 0x7ff60 max_arch_pfn = 0x4
[0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new
0x7010600070106
[0.00] init_memory_mapping: -7ff6
[0.00] RAMDISK: 375e7000 - 37fef333
[0.00] ACPI: RSDP 000f7200 00024 (v02 PTLTD )
[0.00] ACPI: XSDT 7ff6d424 00044 (v01 PTLTD  ? XSDT
0604  LTP )
[0.00] ACPI: FACP 7ff71a2a 000F4 (v03 SUNSUNmetro
0604 PTEC 000F4240)
[0.00] ACPI: DSDT 7ff6d468 0454E (v01SUNK85AE
0604 MSFT 010D)
[0.00] ACPI: FACS 7ff72fc0 00040
[0.00] ACPI: SRAT 7ff71b1e 000C8 (v01 AMDHAMMER
0604 AMD  0001)
[0.00] ACPI: APIC 7ff71be6 000AA (v01 PTLTD  ? APIC
0604  LTP )
[0.00] ACPI: SSDT 7ff71c90 00370 (v01 PTLTD  POWERNOW
0604  LTP 0001)
[0.00] SRAT: PXM 0 - APIC 0 - Node 0
[0.00] SRAT: PXM 1 - APIC 1 - Node 1
[0.00] SRAT: Node 0 PXM 0 0-a
[0.00] SRAT: Node 0 PXM 0 10-4000
[0.00] SRAT: Node 1 PXM 1 4000-8000
[0.00] Bootmem setup node 0 -4000
[0.00]   NODE_DATA [b040 - 0001303f]
[0.00]   bootmap [00014000 -  0001bfff] pages 8
[0.00] (8 early reservations) == bootmem [00 -
004000]
[0.00]   #0 [00 - 001000]   BIOS data page ==
[00 - 001000]
[0.00]   #1 [006000 - 008000]   TRAMPOLINE ==
[006000 - 008000]
[0.00]   #2 [000100 - 0001688414]TEXT DATA BSS ==
[000100 - 0001688414]
[0.00]   #3 [00375e7000 - 0037fef333]  RAMDISK ==
[00375e7000 - 0037fef333]
[0.00]   #4 [09d400 - 10]BIOS reserved ==
[09d400 - 10]
[0.00]   #5 [0001689000 - 00016890c8]  BRK ==
[0001689000 - 00016890c8]
[0.00]   #6 [008000 - 00a000]  PGTABLE ==
[008000 - 00a000]
[0.00]   #7 [00a000 - 00b040]   MEMNODEMAP ==
[00a000 - 00b040]
[0.00]   Bootmem setup node 1 4000-7ff6
[0.00]   NODE_DATA [4000 - 40007fff]
[0.00]   bootmap [40008000 -  4000ffef] pages 8
[0.00] (8 early reservations) == bootmem [004000 -
007ff6]
[0.00]   #0 [00 - 001000]   BIOS data page
[0.00]   #1 [006000 - 008000]   TRAMPOLINE
[0.00]   #2 [000100 - 0001688414]TEXT DATA BSS
[0.00]   #3 [00375e7000 - 0037fef333]  RAMDISK
[0.00]   #4 [09d400 - 10]BIOS reserved
[0.00]   #5 [0001689000 - 00016890c8]  BRK
[0.00]   #6 [008000 - 00a000]  PGTABLE
[0.00]   #7 [00a000 - 00b040]   MEMNODEMAP
[0.00] found SMP MP-table at [880f7250] f7250
[0.00] Zone PFN ranges:
[0.00]   DMA  0x - 0x1000
[0.00]   DMA320x1000 - 0x0010
[0.00]   Normal   0x0010 - 0x0010
[0.00] Movable zone start PFN for each node
[0.00] early_node_map[3] active PFN ranges
[0.00] 0: 0x - 0x009d
[0.00] 0: 0x0100 - 0x0004
[0.00] 1: 0x0004 - 

Bug#603229: Further information

2010-11-22 Thread Ben Hutchings
On Mon, 2010-11-22 at 19:08 +0100, Tilo Hacke wrote:
 Hi !
 
 
 I just have tried the last 2.6.31-2 an it is working flawlessly.

 Further i have set up a serial connection to my SB1500 and so got an
 protocol of the boot process and the crash:
[...]
 [0.536565] ERROR: domain-cpu_power not set
 [0.540002]
 [0.544002] ERROR: groups don't span domain-span
 [0.548011] ERROR: parent span is not a superset of domain-span
 [0.552016] ERROR: domain-cpu_power not set
 [0.556002]
 [0.560002] ERROR: groups don't span domain-span
 [0.564024] ERROR: domain-cpu_power not set
 [0.568005]
 [0.572002] ERROR: groups don't span domain-span
 [0.576023] ERROR: domain-cpu_power not set
 [0.580002]
 [0.584002] ERROR: groups don't span domain-span
[...]

This shows something about what's going wrong.  Could you please try
adding 'debug' to the kernel parameters?  That will show some more
context for these errors.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.


signature.asc
Description: This is a digitally signed message part