Configure a netdump or netconsole server. It will catch the relevant messages.
Derek Hazell wrote: > > Dear OCFS2 forum > > We run ocfs2 version 1.2.9-1 as an ocfs2 cluster on four Linux servers > running RHEL 4 (kernel: 2.6.9-42.0.2.ELs) > > We are getting unexpected reboots of one of the Linux servers and are > wondering if the reboots are related to ocfs2 or not. > We enable tracing of ocfs2 on the node we suspected would reboot > # debugfs.ocfs2 -l SUPER allow > # debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow > and then waited for the reboot to occur. A sample of log messages > around the time of the reboot is included below. There are no strange > ocfs2 messages in the /var/log/messages log file but I thought I would > just check with your forum if you see anything strange. > > Can you confirm that ocfs2 version 1.2.9-1 is compatible with the > Linux kernel : 2.6.9-42.0.2.ELs thanks. Also if ocfs2 fences a node > can you confirm that a message is written to the /var/log/messages > logfile noting that such fencing has occurred. Your responses may help > us narrow down the cause > Can you let us know if there are any particular logfiles we should > check, or if there is anything we can do to confirm that ocfs2 is, or > is not, the cause of these reboots. > > Appreciate any responses > > regards > Derek Hazell | System Administrator > ##################################################################### > APPENDIX 1 : REBOOT on Friday night (ocfs2 tracing running) > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5b1914dc72d356 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5b1914dc72d356 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M0000000000000009f1bbc95e1dad74 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M000000000000000c5bc95ddc72d357 > Aug 15 21:00:52 Sysname kernel: (6885,0):dlm_mle_release:535 ENTRY: > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M00000000000000049c73bf5e1d8e29 > Aug 15 21:00:52 Sysname kernel: > (6885,0):__dlm_lookup_lockres_full:148 > ENTRY:M00000000000000049c73bf5e1d8e29 > Aug 15 21:00:52 Sysname kernel: (6885,0):__dlm_lookup_lockres:182 > ENTRY:M00000000000000049c73bf5e1d8e29 > [UNEXPECTED REBOOT] > Aug 15 21:05:09 Sysname syslogd 1.4.1: restart. > Aug 15 21:05:09 Sysname syslog: syslogd startup succeeded > Aug 15 21:05:09 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Aug 15 21:05:09 Sysname kernel: Bootdata ok (command line is ro > root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) > Aug 15 21:05:09 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp > ([EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>) (gcc version 3.4.6 > 20060404 (Red Hat 3.4.6-3)) #1 > SMP Thu Aug 17 17:57:31 EDT 2006 > Aug 15 21:05:09 Sysname kernel: BIOS-provided physical RAM map: > ###################################################################### > APPENDIX 2 : REBOOT on Saturday night (ocfs2 tracing NOT running) > Aug 15 21:08:12 Sysname kernel: o2net: connected to node > Othersystem2.x.y (num 1) at 172.16.172.172:7777 > <http://172.16.172.172:7777> > Aug 15 21:08:13 Sysname kernel: o2net: accepted connection from node > Othersystem1.x.y (num 3) at 172.16.172.171:7777 > <http://172.16.172.171:7777> > Aug 15 21:08:16 Sysname kernel: OCFS2 1.2.9 Mon May 19 13:00:33 PDT > 2008 (build a693806cb619dd7f225004092b675ede) > Aug 15 21:08:16 Sysname kernel: ocfs2_dlm: Nodes in domain > ("46C5D4A751514E55B04786DFEC7B2175"): 1 2 3 > Aug 15 21:08:17 Sysname kernel: kjournald starting. Commit interval > 5 seconds > Aug 15 21:08:17 Sysname kernel: ocfs2: Mounting device (120,1) on > (node 2, slot 2) > Aug 15 21:08:21 Sysname kernel: ocfs2_dlm: Nodes in domain > ("0D29B3C9792B46E1BD0DFF0A97E03534"): 1 2 3 > Aug 15 21:08:21 Sysname kernel: kjournald starting. Commit interval > 5 seconds > Aug 15 21:08:21 Sysname kernel: ocfs2: Mounting device (120,17) on > (node 2, slot 2) > Aug 15 21:08:31 Sysname ntpd[7076]: synchronized to 172.16.32.254 > <http://172.16.32.254>, stratum 2 > Aug 15 21:08:31 Sysname ntpd[7076]: kernel time sync disabled 0041 > Aug 15 21:08:38 Sysname su(pam_unix)[9656]: session opened for user > digicol by root(uid=0) > Aug 15 21:08:41 Sysname su(pam_unix)[9656]: session closed for user > digicol > Aug 15 21:13:52 Sysname ntpd[7076]: kernel time sync enabled 0001 > Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, > sector 1291272320 > Aug 15 21:41:46 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:46 Sysname kernel: end_request: I/O error, dev sdc, > sector 1487646848 > Aug 15 21:41:47 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:47 Sysname kernel: end_request: I/O error, dev sdc, > sector 1301852288 > Aug 15 21:41:48 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:41:48 Sysname kernel: end_request: I/O error, dev sdc, > sector 1498484864 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1611251840 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1045610624 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1234243712 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 989614208 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1115283584 > Aug 15 21:45:09 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:09 Sysname kernel: end_request: I/O error, dev sdc, > sector 1240952960 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 995807360 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 1104961664 > Aug 15 21:45:14 Sysname kernel: SCSI error : <1 0 2 1> return code = > 0x20000 > Aug 15 21:45:14 Sysname kernel: end_request: I/O error, dev sdc, > sector 1008507952 > Aug 16 03:00:26 Sysname Server Administrator: Storage Service > EventID: 2242 The Patrol Read has started.: Controller 0 (PERC 5/i > Integrated) > Aug 16 03:00:27 Sysname snmpd[7589]: Got trap from peer on fd 13 > Aug 16 03:52:02 Sysname Server Administrator: Storage Service > EventID: 2243 The Patrol Read has stopped.: Controller 0 (PERC 5/i > Integrated) > Aug 16 03:52:02 Sysname snmpd[7589]: Got trap from peer on fd 13 > Aug 16 16:38:33 Sysname sshd(pam_unix)[31901]: session opened for > user root by root(uid=0) > Aug 16 16:55:55 Sysname sshd(pam_unix)[32254]: session opened for > user root by root(uid=0) > Aug 16 17:27:06 Sysname sshd(pam_unix)[966]: session opened for user > root by root(uid=0) > [UNEXPECTED REBOOT] > Aug 16 23:18:31 Sysname syslogd 1.4.1: restart. > Aug 16 23:18:31 Sysname syslog: syslogd startup succeeded > Aug 16 23:18:31 Sysname kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Aug 16 23:18:31 Sysname kernel: Bootdata ok (command line is ro > root=/dev/VolGroup_ID_12182/LogVol1 rhgb quiet) > Aug 16 23:18:31 Sysname kernel: Linux version 2.6.9-42.0.2.ELsmp > ([EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>) (gcc version 3.4.6 > 20060404 (Red Hat 3.4.6-3)) #1 > SMP Thu Aug 17 17:57:31 EDT 2006 > Aug 16 23:18:31 Sysname kernel: BIOS-provided physical RAM map: > ##################################################################### > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > [email protected] > http://oss.oracle.com/mailman/listinfo/ocfs2-users _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
