Finally set up network logging: kernel -> syslog-ng -> TCP (crossover) -> syslog-ng (other box) -> log file. This time, I actually caught something from the crash. It may be hardware-related, but I thought I'd report it here this time because the crash was definitely in Reiser4 code. This may or may not be relevant:
Oct 6 02:15:27 elite irq event 217: bogus return value 8027d751 Oct 6 02:15:27 elite Oct 6 02:15:27 elite Call Trace: <IRQ> <ffffffff8028f765>{__report_bad_irq+53} Oct 6 02:15:27 elite <ffffffff8028f812>{note_interrupt+82} <ffffffff8028f1e9>{_ _do_IRQ+169} Oct 6 02:15:27 elite <ffffffff802611e2>{do_IRQ+66} <ffffffff8025f290>{default_i dle+0} Oct 6 02:15:27 elite <ffffffff802582ac>{ret_from_intr+0} <EOI> <ffffffff8025a76 f>{thread_return+0} Oct 6 02:15:27 elite <ffffffff8025f2ba>{default_idle+42} <ffffffff8024592d>{cpu _idle+61} Oct 6 02:15:27 elite <ffffffff8048384f>{start_kernel+495} <ffffffff80483255>{_s inittext+597} Oct 6 02:15:27 elite handlers: Oct 6 02:15:27 elite [<ffffffff8033cea0>] (usb_hcd_irq+0x0/0x60) Oct 6 02:15:27 elite [<ffffffff880b55a0>] (nv_nic_irq+0x0/0x180 [forcedeth]) This message repeated a couple of times beforehand. The only log entries between that and the crash are my ntpd, apparently trying to set the clock, apparently not bothering to actually do it: Oct 6 02:16:03 elite ntpd[4996]: adjusting local clock by -2012.380177s Oct 6 02:19:55 elite ntpd[4996]: adjusting local clock by -2012.350292s Oct 6 02:20:25 elite ntpd[4996]: adjusting local clock by -2012.088300s Oct 6 02:24:42 elite ntpd[4996]: adjusting local clock by -2011.895028s Oct 6 02:27:58 elite ntpd[4996]: adjusting local clock by -2011.637006s Oct 6 02:32:12 elite ntpd[4996]: adjusting local clock by -2011.575243s Oct 6 02:33:15 elite ntpd[4996]: adjusting local clock by -2011.448711s Oct 6 02:35:23 elite ntpd[4996]: adjusting local clock by -2011.293673s I can't believe my clock is skewed that badly, especially when I manually restart it, I get log entries like this: Oct 5 04:05:50 grunt ntpd[6601]: adjusting local clock by -0.146991s I believe this is the crash, though: Oct 6 02:38:36 elite Unable to handle kernel NULL pointer dereference at 000000 0000000038 RIP: Oct 6 02:38:36 elite <ffffffff880808e7>{:reiser4:search_one_bitmap_forward+135} Oct 6 02:38:36 elite PGD 7552e067 PUD 7de47067 PMD 0 Oct 6 02:38:36 elite Oops: 0000 [1] Oct 6 02:38:36 elite CPU 0 Oct 6 02:38:36 elite Modules linked in: xt_tcpudp xt_state ip_conntrack iptable _filter ip_tables nfnetlink_queue nfnetlink xt_NFQUEUE x_tables nfs nfsd exportf s lockd sunrpc snd_seq_midi snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd _seq_midi_emul snd_pcm_oss snd_mixer_oss snd_seq_dummy snd_seq_oss snd_seq_midi_ event snd_seq snd_emu10k1 snd_rawmidi snd_ac97_codec snd_ac97_bus snd_pcm snd_se q_device snd_timer snd_page_alloc snd_util_mem snd_hwdep snd soundcore nvidia af _packet usb_storage joydev ide_cd cdrom amd74xx ide_core sk98lin forcedeth unix reiser4 dm_mod sd_mod sata_nv libata scsi_mod Oct 6 02:38:36 elite Pid: 428, comm: ktxnmgrd:dm-4:r Tainted: P 2.6.17.13 #2 Oct 6 02:38:36 elite RIP: 0010:[<ffffffff880808e7>] <ffffffff880808e7>{:reiser4 :search_one_bitmap_forward+135} Oct 6 02:38:36 elite RSP: 0018:ffff81007f1f5888 EFLAGS: 00010246 Oct 6 02:38:36 elite RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff81007 f1f591c Oct 6 02:38:36 elite RDX: ffffc2000004d660 RSI: ffff810001d8fc10 RDI: ffff81000 1d8fc10 Oct 6 02:38:36 elite RBP: ffff81007f1f591c R08: ffff81007f1f5778 R09: 000000000 0000000 Oct 6 02:38:36 elite R10: 0000000000000010 R11: ffff81007f1f57d8 R12: 000000000 0000001 Oct 6 02:38:36 elite R13: 0000000000007fe0 R14: 0000000000007fe0 R15: ffff81007 f1f5970 Oct 6 02:38:36 elite FS: 00002af33ea0f1b0(0000) GS:ffffffff8047a000(0000) knlG S:0000000000000000 Oct 6 02:38:36 elite CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b Oct 6 02:38:36 elite CR2: 0000000000000038 CR3: 000000007d4f1000 CR4: 000000000 00006e0 Oct 6 02:38:36 elite Process ktxnmgrd:dm-4:r (pid: 428, threadinfo ffff81007f1f 4000, task ffff81007e64a200) Oct 6 02:38:36 elite Stack: 0000000000008000 0000000100000001 ffff81007f1f591c ffffc2000004d660 Oct 6 02:38:36 elite ffff81003df44004 000000007f1f5ae8 0000002000000020 ffff810 07f1f5968 Oct 6 02:38:36 elite ffff81007f1f591c 0000000000000001 Oct 6 02:38:36 elite Call Trace: <ffffffff88080b11>{:reiser4:bitmap_alloc_forwa rd+145} Oct 6 02:38:36 elite <ffffffff88080d2f>{:reiser4:alloc_blocks_bitmap+447} Oct 6 02:38:36 elite <ffffffff88060e53>{:reiser4:plugin_by_unsafe_id+35} Oct 6 02:38:36 elite <ffffffff8807f6a1>{:reiser4:item_length_by_coord+17} Oct 6 02:38:36 elite <ffffffff8804fa65>{:reiser4:reiser4_alloc_blocks+165} Oct 6 02:38:36 elite <ffffffff88053db9>{:reiser4:allocate_znode_update+265} Oct 6 02:38:36 elite <ffffffff8807f9d1>{:reiser4:item_body_by_coord_hard+17} Oct 6 02:38:36 elite <ffffffff880776ef>{:reiser4:internal_at+15} <ffffffff88077 709>{:reiser4:pointer_at+9} Oct 6 02:38:36 elite <ffffffff88077749>{:reiser4:down_link_internal+9} <fffffff f8804c115>{:reiser4:check_tree_pointer+117} Oct 6 02:38:36 elite <ffffffff8805404e>{:reiser4:allocate_znode+334} <ffffffff8 8054157>{:reiser4:lock_parent_and_allocate_znode+167} Oct 6 02:38:36 elite <ffffffff88054554>{:reiser4:handle_pos_on_formatted+468} Oct 6 02:38:36 elite <ffffffff880518d9>{:reiser4:jnode_make_wander+9} <ffffffff 880545f9>{:reiser4:handle_pos_on_leaf+9} Oct 6 02:38:36 elite <ffffffff880553fd>{:reiser4:flush_current_atom+1981} Oct 6 02:38:36 elite <ffffffff8805133d>{:reiser4:txn_end+525} <ffffffff88051639 >{:reiser4:txn_restart+9} Oct 6 02:38:36 elite <ffffffff8805187c>{:reiser4:commit_some_atoms+140} <ffffff ff8805a2bd>{:reiser4:ktxnmgrd+349} Oct 6 02:38:36 elite <ffffffff8805a160>{:reiser4:ktxnmgrd+0} <ffffffff80230c2b> {kthread+219} Oct 6 02:38:36 elite <ffffffff802587ee>{child_rip+8} <ffffffff88021210>{:libata :ata_exec_command+0} Oct 6 02:38:36 elite <ffffffff80285e10>{keventd_create_kthread+0} <ffffffff8023 0b50>{kthread+0} Oct 6 02:38:36 elite <ffffffff802587e6>{child_rip+0} Oct 6 02:38:36 elite Oct 6 02:38:36 elite Code: 48 8b 50 38 8b 01 48 8b 4c 24 18 44 8b 41 28 44 39 c 0 76 2e Oct 6 02:38:36 elite RIP <ffffffff880808e7>{:reiser4:search_one_bitmap_forward+ 135} RSP <ffff81007f1f5888> Oct 6 02:38:36 elite CR2: 0000000000000038 At this point, the filesystem was basically dead. I could basically get things to work, so long as they were still cached. Any attempt to write data would hang, and I don't believe I could read anything, either, but I didn't try. Without being able to sync, I had to hard-boot it. Things seem to be working alright, but this does happen to my desktop occasionally. I can't reproduce it reliably. I used to be able to reliably reproduce a hard lockup -- no ssh, no keyboard, no mouse -- by simply playing certain games for a few hours, but that seems to have been fixed with a recently updated nVidia driver. There is a nearly identical box, acting as a server, running Reiser4, which simply never has problems. Currently has an uptime of 25 days, last rebooted for a kernel update. They are still different enough that it could be a hardware problem, or some system outside Reiser4 -- the desktop has Reiser4 compiled as a module, and uses it on a device-mapper RAID root.
signature.asc
Description: OpenPGP digital signature