Bug#519158: Still problems with already changed hardware

2009-06-18 Thread Michael Rumpler
Here is some update on the problem...

I have updated the system in the following ways as there were still
problems with the drives...

- i replaced the motherboard (Intel D945GCLF2) as the errors only 
  occured on drive sda (connected to one of the two SATA ports on the   
  board itself)
- i replaced the SSD with 2 drives like the 3 drives on the SiI 3124 
  (all three drives never had smilar problems since the machine is 
  running), so there are now two WDC WD5000ABPS-0 Rev: 02.0 connected
  directly to the motherboard

I hoped this will solve the problem...

But yesterday night again the SATA system hat trouble:

-- snip 
Jun 18 06:46:37 atom kernel: [697701.292480] ata1.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x0
Jun 18 06:46:37 atom kernel: [697701.292520] ata1.00: cmd
ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Jun 18 06:46:37 atom kernel: [697701.292523]  res
51/04:00:0a:24:f9/00:00:00:00:00/a9 Emask 0x1 (device error)
Jun 18 06:46:37 atom kernel: [697701.292583] ata1.00: status: { DRDY
ERR }
Jun 18 06:46:37 atom kernel: [697701.292604] ata1.00: error: { ABRT }
Jun 18 06:46:38 atom kernel: [697701.316560] ata1.00: failed to read
native max address (err_mask=0x1)
Jun 18 06:46:38 atom kernel: [697701.316589] ata1.00: HPA support seems
broken, skipping HPA handling
Jun 18 06:46:38 atom kernel: [697701.828285] ata1.00: configured for
UDMA/133 (device error ignored)
Jun 18 06:46:38 atom kernel: [697701.828327] end_request: I/O error, dev
sda, sector 59922239
Jun 18 06:46:38 atom kernel: [697701.828357] md: super_written gets
error=-5, uptodate=0
Jun 18 06:46:38 atom kernel: [697701.828384] raid1: Disk failure on
sda1, disabling device.
Jun 18 06:46:38 atom kernel: [697701.828386] raid1: Operation continuing
on 1 devices.
Jun 18 06:46:38 atom kernel: [697701.828462] ata1: EH complete
Jun 18 06:46:38 atom kernel: [697701.828655] sd 0:0:0:0: [sda] 976773168
512-byte hardware sectors: (500 GB/465 GiB)
Jun 18 06:46:38 atom kernel: [697701.828769] sd 0:0:0:0: [sda] Write
Protect is off
Jun 18 06:46:38 atom kernel: [697701.828795] sd 0:0:0:0: [sda] Mode
Sense: 00 3a 00 00
Jun 18 06:46:38 atom kernel: [697701.828876] sd 0:0:0:0: [sda] Write
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jun 18 06:46:38 atom kernel: [697702.255140] RAID1 conf printout:
Jun 18 06:46:38 atom kernel: [697702.255169]  --- wd:1 rd:2
Jun 18 06:46:38 atom kernel: [697702.255191]  disk 0, wo:0, o:1,
dev:sdb1
Jun 18 06:46:38 atom kernel: [697702.255213]  disk 1, wo:1, o:0,
dev:sda1
Jun 18 06:46:38 atom kernel: [697702.260017] RAID1 conf printout:
Jun 18 06:46:38 atom kernel: [697702.260040]  --- wd:1 rd:2
Jun 18 06:46:38 atom kernel: [697702.260060]  disk 0, wo:0, o:1,
dev:sdb1
Jun 18 06:50:14 atom kernel: [697918.50] ata1.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 18 06:50:14 atom kernel: [697918.92] ata1.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
Jun 18 06:50:14 atom kernel: [697918.95]  res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 18 06:50:14 atom kernel: [697918.000155] ata1.00: status: { DRDY }
Jun 18 06:50:19 atom kernel: [697923.040022] ata1: link is slow to
respond, please be patient (ready=0)
Jun 18 06:50:24 atom kernel: [697928.024026] ata1: device not ready
(errno=-16), forcing hardreset
Jun 18 06:50:24 atom kernel: [697928.024061] ata1: soft resetting link
Jun 18 06:50:29 atom kernel: [697933.220020] ata1: link is slow to
respond, please be patient (ready=0)
Jun 18 06:50:39 atom kernel: [697942.536144] ata1.00: qc timeout (cmd
0xec)
Jun 18 06:50:39 atom kernel: [697942.536177] ata1.00: failed to IDENTIFY
(I/O error, err_mask=0x4)
Jun 18 06:50:39 atom kernel: [697942.536204] ata1.00: revalidation
failed (errno=-5)
Jun 18 06:50:44 atom kernel: [697947.576020] ata1: link is slow to
respond, please be patient (ready=0)
Jun 18 06:50:49 atom kernel: [697952.560023] ata1: device not ready
(errno=-16), forcing hardreset
Jun 18 06:50:49 atom kernel: [697952.560061] ata1: soft resetting link
Jun 18 06:50:58 atom kernel: [697961.476299] ata1.00: configured for
UDMA/133 (device error ignored)
Jun 18 06:50:58 atom kernel: [697961.476371] ata1: EH complete
Jun 18 06:50:58 atom kernel: [697961.478921] sd 0:0:0:0: [sda] 976773168
512-byte hardware sectors: (500 GB/465 GiB)
Jun 18 06:50:58 atom kernel: [697961.479078] sd 0:0:0:0: [sda] Write
Protect is off
Jun 18 06:50:58 atom kernel: [697961.479110] sd 0:0:0:0: [sda] Mode
Sense: 00 3a 00 00
Jun 18 06:50:58 atom kernel: [697961.479212] sd 0:0:0:0: [sda] Write
cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jun 18 06:55:11 atom kernel: [698215.63] ata1.00: exception Emask
0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 18 06:55:11 atom kernel: [698215.000101] ata1.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
Jun 18 06:55:11 atom kernel: [698215.000104]  res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 18 06:55:11 atom kernel: 

Bug#519158:

2009-03-18 Thread Michael Rumpler
The ATA subsystem is working so far.. but with the new kernel a
different problem showed up today:

[608566.964023] [ cut here ]
[608566.964054] WARNING: at net/sched/sch_generic.c:226 dev_watchdog
+0xf6/0x17c()
[608566.964092] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
[608566.964117] Modules linked in: tun nfsd auth_rpcgss exportfs nfs
lockd nfs_acl sunrpc ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4
nf_conntrack nf_defrag_ipv4 ip_tables x_tables ipv6 smsc47m192 hwmon_vid
loop joydev evdev psmouse snd_pcm snd_timer snd soundcore snd_page_alloc
serio_raw pcspkr rng_core i2c_i801 i2c_core iTCO_wdt asix usbnet button
intel_agp agpgart ext3 jbd mbcache dm_mirror dm_region_hash dm_log
dm_snapshot dm_mod raid456 md_mod async_xor async_memcpy async_tx xor
usbhid hid usb_storage sd_mod crc_t10dif ata_piix ata_generic sata_sil24
ehci_hcd libata ide_pci_generic scsi_mod ide_core uhci_hcd usbcore r8169
mii thermal processor fan thermal_sys
[608566.964488] Pid: 0, comm: swapper Not tainted 2.6.28-1-686 #1
[608566.964512] Call Trace:
[608566.964538]  [c0126d36] warn_slowpath+0x5a/0x79
[608566.964566]  [c01f85d4] __next_cpu+0x12/0x21
[608566.964591]  [c011f888] find_busiest_group+0x307/0x78f
[608566.964620]  [c013ca9b] getnstimeofday+0x4f/0xd1
[608566.964646]  [c01fc1fc] strlcpy+0x11/0x3d
[608566.964669]  [c028dac2] dev_watchdog+0xf6/0x17c
[608566.964695]  [c013aeb3] sched_clock_tick+0x95/0x9e
[608566.964721]  [c0138da6] hrtimer_forward+0x10c/0x124
[608566.964747]  [c013ca9b] getnstimeofday+0x4f/0xd1
[608566.964773]  [c011025d] lapic_next_event+0x10/0x13
[608566.964798]  [c028d9cc] dev_watchdog+0x0/0x17c
[608566.964823]  [c012e4b4] run_timer_softirq+0x14a/0x1b4
[608566.964849]  [c028d9cc] dev_watchdog+0x0/0x17c
[608566.964874]  [c012b28f] __do_softirq+0x8c/0x130
[608566.965679]  [c012b378] do_softirq+0x45/0x53
[608566.965703]  [c012b480] irq_exit+0x35/0x69
[608566.965728]  [c01109c5] smp_apic_timer_interrupt+0x6e/0x78
[608566.965754]  [c0104620] apic_timer_interrupt+0x28/0x30
[608566.965780]  [c010924a] mwait_idle+0x2f/0x3b
[608566.965804]  [c0102a37] cpu_idle+0x71/0x8a
[608566.965827] ---[ end trace 23769fc216abaa67 ]---
[608566.982729] r8169: eth0: link up

Somehow the onboard LAN hickupped. The disk subsystem is still fine.
But the outcome seems similar to the problems above.
The LAN came up by itself again without needing to reboot the machine.

any suggestions?




-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#519158: linux-image-2.6.26-1-686: ATA subsystem crashes and renders single drive unusable

2009-03-11 Thread Michael Rumpler
On Tue, 2009-03-10 at 19:19 +0100, maximilian attems wrote:
 can you try 2.6.28 sid snapshot, see sid aptline
 - http://wiki.debian.org/DebianKernel

Just installed and booted it.

[m...@atom ~]$ uname -a
Linux atom 2.6.28-1-686 #1 SMP Wed Mar 11 04:36:21 UTC 2009 i686
GNU/Linux

Longest delay for the problem to show up was 18 days so far.. so we have
to wait...




-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#519158: linux-image-2.6.26-1-686: ATA subsystem crashes and renders single drive unusable

2009-03-10 Thread Michael Rumpler
Package: linux-image-2.6.26-1-686
Version: 2.6.26-13
Severity: important


I am using the following HW configuration:
Mainboard: Intel D945GCLF2 Mainboard
Additional SATA Controller: Silicon Image, Inc. SiI 3124

Connected directly to the board: 1 Transcend TS32GSSD25S-M SATA SSD Rev: V082
Connected to the SiI controller: 3 WDC WD5000ABPS-0 Rev: 02.0

From time to time (differs from 1 to 4 weeks) the system gets to some
state where the following kernel messages are logged:

Mar  8 21:19:08 atom kernel: [1953069.141042] [ cut here 
]
Mar  8 21:19:08 atom kernel: [1953069.141042] WARNING: at 
drivers/ata/libata-sff.c:1321 ata_sff_hsm_move+0x5ff/0x674 [libata]()
Mar  8 21:19:08 atom kernel: [1953069.141042] Modules linked in: tun nfsd 
auth_rpcgss exportfs nfs lockd nfs_acl sunrpc ipt_MASQUERADE iptable_nat nf_nat 
nf_conntrack_ipv4 nf_conntrack ip_tables x_tables ipv6 usb_storage smsc47m192 
hwmon_vid loop serio_raw rng_core button psmouse asix usbnet snd_pcm snd_timer 
snd iTCO_wdt i2c_i801 soundcore mii snd_page_alloc i2c_core pcspkr intel_agp 
agpgart joydev evdev ext3 jbd mbcache dm_mirror dm_log dm_snapshot dm_mod 
raid456 md_mod async_xor async_memcpy async_tx xor sd_mod ata_piix usbhid hid 
ff_memless ata_generic sata_sil24 libata scsi_mod piix dock ide_pci_generic 
ide_core ehci_hcd uhci_hcd usbcore r8169 thermal processor fan thermal_sys
Mar  8 21:19:08 atom kernel: [1953069.141042] Pid: 0, comm: swapper Tainted: G  
  W 2.6.26-1-686 #1
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c012256f] 
warn_on_slowpath+0x40/0x66
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c01318e9] 
autoremove_wake_function+0xd/0x2d
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c011845d] 
__wake_up_common+0x2e/0x58
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c011a641] __wake_up+0x29/0x39
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f8940b46] 
md_wakeup_thread+0x1e/0x20 [md_mod]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f8974906] 
release_stripe+0x21/0x2e [raid456]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f897837a] 
raid5_end_write_request+0x0/0x99 [raid456]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f88e73d0] 
scsi_run_queue+0x200/0x219 [scsi_mod]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c01d0175] 
elv_queue_empty+0x1d/0x1e
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f89111d5] 
ata_sff_hsm_move+0x5ff/0x674 [libata]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f88e7b3b] 
scsi_end_request+0x62/0x6b [scsi_mod]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f88e863b] 
scsi_io_completion+0x1a6/0x363 [scsi_mod]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [f8911e88] 
ata_sff_interrupt+0x127/0x19f [libata]
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c0151fd2] 
handle_IRQ_event+0x23/0x51
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c01530d1] 
handle_fasteoi_irq+0x71/0xa4
Mar  8 21:19:08 atom kernel: [1953069.141042]  [c0105f3a] do_IRQ+0x4d/0x63
Mar  8 21:19:08 atom kernel: [1953069.141043]  [c0108bbf] mwait_idle+0x0/0x3d
Mar  8 21:19:08 atom kernel: [1953069.141043]  [c01042a7] 
common_interrupt+0x23/0x28
Mar  8 21:19:08 atom kernel: [1953069.141043]  [c0108bbf] mwait_idle+0x0/0x3d
Mar  8 21:19:08 atom kernel: [1953069.141043]  [c0108bee] mwait_idle+0x2f/0x3d
Mar  8 21:19:08 atom kernel: [1953069.141043]  [c01025ce] cpu_idle+0xab/0xcb
Mar  8 21:19:08 atom kernel: [1953069.141043]  ===

after that the SSD drive connected to the board itself starts to make
troubles:

Mar  8 21:19:08 atom kernel: [1953069.141042] ata5.00: exception Emask 0x0 SAct 
0x0 SErr 0x0 action 0x0
Mar  8 21:19:08 atom kernel: [1953069.141042] ata5.00: BMDMA stat 0x26
Mar  8 21:19:08 atom kernel: [1953069.141042] ata5.00: cmd 
ca/00:08:a7:20:14/00:00:00:00:00/e0 tag 0 dma 4096 out
Mar  8 21:19:08 atom kernel: [1953069.141042]  res 
51/40:08:a7:20:14/00:00:00:00:00/e0 Emask 0x29 (host bu s error)
Mar  8 21:19:08 atom kernel: [1953069.141042] ata5.00: status: { DRDY ERR }
Mar  8 21:19:08 atom kernel: [1953069.141042] ata5.00: error: { UNC }
Mar  8 21:19:38 atom kernel: [1953106.145781] ata5.00: qc timeout (cmd 0xef)
Mar  8 21:19:38 atom kernel: [1953106.145818] ata5.00: failed to set xfermode 
(err_mask=0x4)
Mar  8 21:19:38 atom kernel: [1953106.145845] ata5: failed to recover some 
devices, retrying in 5 secs
Mar  8 21:19:48 atom kernel: [1953119.090028] ata5: link is slow to respond, 
please be patient (ready=0)
Mar  8 21:19:53 atom kernel: [1953126.002047] ata5: device not ready 
(errno=-16), forcing hardreset
Mar  8 21:19:53 atom kernel: [1953126.002091] ata5: soft resetting link
Mar  8 21:19:58 atom kernel: [1953132.995359] ata5: link is slow to respond, 
please be patient (ready=0)
Mar  8 21:20:03 atom kernel: [1953139.929986] ata5: SRST failed (errno=-16)
Mar  8 21:20:03 atom kernel: [1953139.930027] ata5: soft resetting link
Mar  8 21:20:08 atom kernel: [1953146.559721] ata5: link is slow to respond, 
please be patient (ready=0)
Mar  8 21:20:13