Bug#667858: linux-image-2.6.32-5-amd64: Crash in ext3 mark_inode_dirty

2012-04-07 Thread Steven Ihde

On Apr 6, 2012, at 10:51 PM, Ben Hutchings wrote:
 More importantly, a machine check exception (MCE) indicates faulty
 hardware - this could be the processor, motherboard, memory (if it has
 ECC) or even an expansion card.  Whatever it is, that is quite likely to
 be the cause of the problem and must be fixed before we do any
 investigation of software.  You should find some record of the MCE in
 the kernel log which may provide a hint as to what is faulty.

Thanks for responding.  /var/log/mcelog contains a few cycles of thermal events 
(like below), but nothing since Feb. 3.  The most recent crash was yesterday 
morning.  Is the mention of MCE generated by reportbug talking about this 
thermal event, or does that mean there is something more recent that's not 
showing up in /var/log/mcelog?

$ ls -l /var/log/mcelog
-rw-r--r-- 1 root root 23569 Feb  3 08:15 /var/log/mcelog

$ tail /var/log/mcelog
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 THERMAL EVENT TSC 11634544f0
Processor core is above trip temperature. Throttling enabled.
STATUS 3 MCGSTATUS 0
MCE 1
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 1 THERMAL EVENT TSC 11636e2d01
Processor core below trip temperature. Throttling disabled
STATUS 2 MCGSTATUS 0





--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#667858: linux-image-2.6.32-5-amd64: Crash in ext3 mark_inode_dirty

2012-04-07 Thread Ben Hutchings
On Fri, 2012-04-06 at 23:07 -0700, Steven Ihde wrote:
 On Apr 6, 2012, at 10:51 PM, Ben Hutchings wrote:
  More importantly, a machine check exception (MCE) indicates faulty
  hardware - this could be the processor, motherboard, memory (if it has
  ECC) or even an expansion card.  Whatever it is, that is quite likely to
  be the cause of the problem and must be fixed before we do any
  investigation of software.  You should find some record of the MCE in
  the kernel log which may provide a hint as to what is faulty.
 
 Thanks for responding.  /var/log/mcelog contains a few cycles of
 thermal events (like below), but nothing since Feb. 3.  The most
 recent crash was yesterday morning.  Is the mention of MCE generated
 by reportbug talking about this thermal event,

reportbug just tells us whether the kernel detected an MCE (or any of
several other problems).

 or does that mean there is something more recent that's not showing up
 in /var/log/mcelog?
[...]

I don't know.  Fix the cooling first; your system is not going to be
stable until you do that.

Ben.

-- 
Ben Hutchings
Larkinson's Law: All laws are basically false.


signature.asc
Description: This is a digitally signed message part


Bug#667858: linux-image-2.6.32-5-amd64: Crash in ext3 mark_inode_dirty

2012-04-06 Thread Steven Ihde
Package: linux-2.6
Version: 2.6.32-41squeeze2
Severity: important


The kernel sometimes crashes, rendering the system unresponsive.  The stack
trace is visible on the TV monitor connected via HDMI.  The stack trace is 
always the same, beginning at a write system call and ending at 
__mark_inode_dirty.  

The system is used as a MythTV frontend and backend.
It spends a lot of time suspended so it could be related to suspend/resume.

This had gone on for quite a while with the standard Lenny kernel.  After
upgrading to Squeeze a few weeks ago, the problem continues to occur so
I'm reporting the bug now.

I will upload a photo of the stack trace; here are the function names:

__mark_inode_dirty
__block_commit_write
ext3_ordered_write_end
ext3_xattr_get
generic_file_buffered_write
__generic_file_aio_write
__switch_to
cpumask_any_but
generic_file_aio_write
do_sync_write
autoremove_wake_function
handle_mm_fault
do_fork
vfs_write
sys_write
system_call_fastpath

Thanks,

Steve

-- Package-specific info:
** Version:
Linux version 2.6.32-5-amd64 (Debian 2.6.32-41squeeze2) (da...@debian.org) (gcc 
version 4.3.5 (Debian 4.3.5-4) ) #1 SMP Thu Mar 22 17:26:33 UTC 2012

** Command line:
BOOT_IMAGE=/boot/vmlinuz-2.6.32-5-amd64 
root=UUID=3c9beddf-b59c-4c7a-8022-14f268cec832 ro quiet

** Tainted: PM (17)
 * Proprietary module has been loaded.
 * System experienced a machine check exception.

** Kernel log:
[18035.230578] e100 :07:08.0: restoring config space at offset 0xf (was 
0x38080100, writing 0x3808010b)
[18035.230593] e100 :07:08.0: restoring config space at offset 0x5 (was 
0x1, writing 0x1101)
[18035.230598] e100 :07:08.0: restoring config space at offset 0x4 (was 
0x0, writing 0x59004000)
[18035.230603] e100 :07:08.0: restoring config space at offset 0x3 (was 
0x0, writing 0x2010)
[18035.230609] e100 :07:08.0: restoring config space at offset 0x1 (was 
0x290, writing 0x2900017)
[18035.230766] HDA Intel :00:1b.0: PCI INT A - GSI 22 (level, low) - IRQ 
22
[18035.230772] HDA Intel :00:1b.0: setting latency timer to 64
[18035.230807] HDA Intel :00:1b.0: irq 31 for MSI/MSI-X
[18035.230843] uhci_hcd :00:1d.0: PCI INT A - GSI 23 (level, low) - IRQ 23
[18035.230849] uhci_hcd :00:1d.0: setting latency timer to 64
[18035.230870] usb usb2: root hub lost power or was reset
[18035.230889] uhci_hcd :00:1d.1: PCI INT B - GSI 19 (level, low) - IRQ 19
[18035.230895] uhci_hcd :00:1d.1: setting latency timer to 64
[18035.230915] usb usb3: root hub lost power or was reset
[18035.230932] uhci_hcd :00:1d.2: PCI INT C - GSI 18 (level, low) - IRQ 18
[18035.230937] uhci_hcd :00:1d.2: setting latency timer to 64
[18035.230957] usb usb4: root hub lost power or was reset
[18035.230973] uhci_hcd :00:1d.3: PCI INT D - GSI 16 (level, low) - IRQ 16
[18035.230978] uhci_hcd :00:1d.3: setting latency timer to 64
[18035.230998] usb usb5: root hub lost power or was reset
[18035.231037] ehci_hcd :00:1d.7: PCI INT A - GSI 23 (level, low) - IRQ 23
[18035.231044] ehci_hcd :00:1d.7: setting latency timer to 64
[18035.231054] pci :00:1e.0: setting latency timer to 64
[18035.231063] ata_piix :00:1f.1: PCI INT A - GSI 18 (level, low) - IRQ 18
[18035.231068] ata_piix :00:1f.1: setting latency timer to 64
[18035.232098] ata6: port disabled. ignoring.
[18035.232117] ahci :00:1f.2: setting latency timer to 64
[18035.866191] ata2: SATA link down (SStatus 0 SControl 300)
[18035.866221] ata4: SATA link down (SStatus 0 SControl 300)
[18035.866464] C-Media PCI :07:01.0: PCI INT A - GSI 22 (level, low) - 
IRQ 22
[18035.872361] ata5.00: ACPI cmd ef/03:0c:00:00:00:a0 (SET FEATURES) filtered 
out
[18035.872365] ata5.00: ACPI cmd ef/03:42:00:00:00:a0 (SET FEATURES) filtered 
out
[18035.872369] ata5.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) 
filtered out
[18036.000264] ata5.00: configured for UDMA/33
[18036.176039] firewire_core: skipped bus generations, destroying all nodes
[18036.176068] pci :00:1e.0: wake-up capability disabled by ACPI
[18036.176076] e100 :07:08.0: PME# disabled
[18036.193816] parport_pc 00:07: activated
[18036.194328] serial 00:09: activated
[18036.196134] e100: eth0 NIC Link is Up 100 Mbps Full Duplex
[18036.492023] sd 0:0:0:0: [sda] Starting disk
[18036.676037] firewire_core: rediscovered device fw0
[18040.904014] ata3: link is slow to respond, please be patient (ready=0)
[18040.904021] ata1: link is slow to respond, please be patient (ready=0)
[18041.240026] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[18041.344287] ata1.00: configured for UDMA/133
[18041.352025] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[18041.385162] sd 2:0:0:0: [sdb] Starting disk
[18041.451601] ata3.00: configured for UDMA/133
[18041.748017] usb 5-2: reset low speed USB device using uhci_hcd and address 2
[18042.059284] PM: Finishing wakeup.
[18042.059287] Restarting tasks ... done.
[18042.429462] CPU0 attaching NULL sched-domain.

Bug#667858: linux-image-2.6.32-5-amd64: Crash in ext3 mark_inode_dirty

2012-04-06 Thread Ben Hutchings
On Fri, 2012-04-06 at 22:27 -0700, Steven Ihde wrote:
[...]
 ** Tainted: PM (17)
  * Proprietary module has been loaded.
  * System experienced a machine check exception.
[...]

The nvidia driver probably isn't responsible, but I suggest you disable
it so we can rule it out.

More importantly, a machine check exception (MCE) indicates faulty
hardware - this could be the processor, motherboard, memory (if it has
ECC) or even an expansion card.  Whatever it is, that is quite likely to
be the cause of the problem and must be fixed before we do any
investigation of software.  You should find some record of the MCE in
the kernel log which may provide a hint as to what is faulty.

Ben.

-- 
Ben Hutchings
Larkinson's Law: All laws are basically false.


signature.asc
Description: This is a digitally signed message part