Hello, list. NOTE: I am not subscribed, so please CC me in any of your replies.
I recently set up a machine featuring an Intel ICH6 desktop board with two 200GB Seagates (ST3200822AS) in mirrored volume, AHCI mode, with linux kernel 2.6.10 and ext3. After a relatively small amount of I/O (for example, compiling ISC bind), I get lock-ups (keyboard LEDs do not respond, SysRq works :)). Timed vmstat dumps snapshot during a couple of bonnie++ tests are available upon request, and it's evident from those that I/O was rather bursty with regard to interactivity, and most all lock-ups happened during rewrite phase, in second consecutive test run, when cached memory was usually at around 1.75GB. If I only let bonnie++ do one test and then exit() before re-running, I was able to perform numerous tests if the thing survived the first one, of course, as cache size went down to ~12MB after exit(). Setting chunk size and/or buffered i/o had marginal, if any, effect on the success of the test. The usual stack trace after a lock-up looks like this: Pid: 1399, comm: kmirrord/0 EIP: 0060:[<c0725421>] CPU: 0 EIP is at __read_lock_failed+0x5/0x14 EFLAGS: 00000297 Not tainted (2.6.10-sk98-4k-p0f-smp) EAX: c23a959c EBX: c23a959c ECX: c23a9580 EDX: c0a21000 ESI: 000134b3 EDI: c23a958c EBP: 00000000 DS: 007b ES: 007b CR0: 8005003b CR2: 0809b890 CR3: 32159000 CR4: 000006d0 [<c0726cf5>] _read_lock+0x15/0x20 [<c05d65c7>] rh_dec+0x27/0xe0 [<c05d7832>] mirror_end_io+0x32/0x40 [<c05c997c>] clone_endio+0x7c/0x130 [<c05d6c70>] write_callback+0x0/0x80 [<c01692d5>] bio_endio+0x55/-x80 [<c05d6cd7>] write_callback+0x67/0x80 [<c05d0057>] dec_count+0x57/0x80 [<c05d019d>] endio+0x5d/0x80 [<c01692d5>] bio_endio+0x55/0x80 [<c0478367>] __end_that_request_first+0x1c7/0x240 [<c04ee14b>] scsi_end_request+0x3b/0x100 [<c04ee4de>] scsu_io_completion+0x10e/0x4e0 [<c0503841>] sd_rw_intr+0x81/0x280 [<c04e921f>] scsi_finish_command+0x7f/0xe0 [<c04e9128>] scsi_softirq+0xc8/0xf0 [<c012752d>] __do_softirq+0xbd/0xd0 [<c010591a>] do_softirq+0x4a/0x60 ======================= [<c0141719>] irq_exit+0x39/0x40 [<c01057e3>] do_IRQ+0x63/0xb0 [<c0103c36>] common_interrupt+0x1a/0x20 [<c014007b>] audit_log+0x3b/0x41 [<c0726eb9>] _spin_unlock_irq+0x9/0x20 [<c05d61aa>] __rh_alloc+0xfa/0x100 [<c05d655c>] rh_inc+0xac/0xb0 [<c05d6593>] rh_inc_pending+0x33/0x40 [<c05d6eb6>] do_writes+0xe6/0x1d0 [<c05d7025>] do_mirror+0x85/0x90 [<c05d7068>] do_work+0x38/0x70 [<c0132d10>] worker_thread+0x1d0/0x270 [<c05d7030>] do_work+0x0/0x70 [<c011dbd0>] default_wake_function+0x0/0x20 [<c011dbd0>] default_wake_function+0x0/0x20 [<c0132b40>] worker_thread+0x0/0x270 [<c013721a>] kthread+0xba/0xc0 [<c0137160>] kthread+0x0/0xc0 [<c0101305>] kernel_thread_helper+0x5/0x10 I was able to obtain the former one directly, with no previous action, whereas a couple of other stack traces were only possible after a SAK, yet with the exception of below, they were always identical. Sometimes, a __wake_up_common and a __mod_timer would appear just below scsi_io_completion, an audit_log was sometimes missing below common_interrupt, but I suppose that's really far too far down the stack to be of significance. Board data: Product number: D925XCV Desktop Board BIOS Version: CV92510A.86A.0249.2004.0819.1259 Board version: AAC57587-404 I had to patch from a beta sk98 driver to get support for Marvell Gbit card on the board. No other patches were added to the kernel. The ICH6 controller is in AHCI mode and the AHCI driver is the only low-level SCSI driver included in the kernel (NOTE: I have no idea how the QLogic driver came in :)): CONFIG_SCSI=y CONFIG_SCSI_PROC_FS=y CONFIG_SCSI_MULTI_LUN=y CONFIG_SCSI_CONSTANTS=y CONFIG_SCSI_LOGGING=y CONFIG_SCSI_SPI_ATTRS=y CONFIG_SCSI_FC_ATTRS=y CONFIG_SCSI_SATA=y CONFIG_SCSI_SATA_AHCI=y CONFIG_SCSI_ATA_PIIX=m CONFIG_SCSI_QLA2XXX=y Another interesting thing - when I start the dmraid up, I usually (in 999 out of 1000 cases) get the following error messages from the kernel: attempt to access beyond end of device sda: rw=0, want=390725760, limit=390721968 attempt to access beyond end of device sda: rw=0, want=390725888, limit=390721968 attempt to access beyond end of device sda: rw=0, want=390726016, limit=390721968 attempt to access beyond end of device sda: rw=0, want=390726144, limit=390721968 attempt to access beyond end of device sdb: rw=1, want=390725760, limit=390721968 attempt to access beyond end of device sdb: rw=1, want=390725888, limit=390721968 attempt to access beyond end of device sdb: rw=1, want=390726016, limit=390721968 attempt to access beyond end of device sdb: rw=1, want=390726144, limit=390721968 attempt to access beyond end of device ... Nothing b0rks though. Is this something to worry about? Let me know if I can provide any additional information, or if anybody is interested in those vmstat dumps. I'll retry with whatever state the 2.6.11 AHCI driver is in in the mean time. With kind regards, -- Grega Bremec gregab at p0f dot net
pgpDibLK2JC7p.pgp
Description: PGP signature