On Tue, 2010-12-21 at 12:43 +0530, Daniel Raj wrote: > Dec 19 11:42:26 cluster kernel: LustreError: > 23330:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff810049702400 x1353488904620861/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292739246 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 11:42:26 cluster kernel: LustreError: > 23330:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar > messages > Dec 19 11:42:26 cluster kernel: LustreError: Skipped 3 previous similar > messages > Dec 19 11:44:05 cluster kernel: LustreError: 137-5: UUID 'cluster-ost8_UUID' > is not available for connect (no target) > Dec 19 11:44:05 cluster kernel: LustreError: > 23292:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff810283dee000 x1353488904620989/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292739345 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 11:44:05 cluster kernel: LustreError: > 23292:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar > messages > Dec 19 11:44:05 cluster kernel: LustreError: Skipped 3 previous similar > messages
Below is a reboot. If this was a"spontaneous" (i.e. not induced by a system administrator) reboot it was not caused by Lustre. Do you have some sort of heartbeat/failover software installed and running? Could it be that which is causing your reboots? > Dec 19 11:48:11 cluster syslogd 1.4.1: restart. > Dec 19 11:48:11 cluster kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Dec 19 11:48:11 cluster kernel: Linux version > 2.6.18-164.11.1.el5_lustre.1.8.3 (lbu...@x86-build-1) (gcc version 4.1.2 > 20080704 (Red Hat 4.1.2-46)) #1 SMP Fri Apr 9 18:00:39 MDT 2010 > Dec 19 11:48:11 cluster kernel: Command line: ro root=LABEL=/ rhgb quiet > irqpoll maxcpus=1 reset_devices memmap=exactmap > memmap=6...@0kmemmap=5276k@16384kmemmap=1251...@22300kelfcorehdr=147440k > memmap=32K#3144016K > Dec 19 11:48:11 cluster kernel: BIOS-provided physical RAM map: > Dec 19 11:48:11 cluster kernel: BIOS-e820: 0000000000010000 - > 000000000009ec00 (usable) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 000000000009ec00 - > 00000000000a0000 (reserved) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 0000000000100000 - > 00000000bfe54000 (usable) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 00000000bfe54000 - > 00000000bfe5c000 (ACPI data) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 00000000bfe5c000 - > 00000000bfe5d000 (usable) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 00000000bfe5d000 - > 00000000c0000000 (reserved) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 00000000e0000000 - > 00000000f0000000 (reserved) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 00000000fec00000 - > 00000000fed00000 (reserved) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 00000000fee00000 - > 00000000fee10000 (reserved) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 00000000ffc00000 - > 0000000100000000 (reserved) > Dec 19 11:48:11 cluster kernel: BIOS-e820: 0000000100000000 - > 000000043ffff000 (usable) > Dec 19 11:48:11 cluster kernel: user-defined physical RAM map: > Dec 19 11:48:11 cluster kernel: user: 0000000000000000 - 00000000000a0000 > (usable) > Dec 19 11:48:12 cluster kernel: user: 0000000001000000 - 0000000001527000 > (usable) > Dec 19 11:48:12 cluster kernel: user: 00000000015c7000 - 0000000008ffc000 > (usable) > Dec 19 11:48:12 cluster kernel: user: 00000000bfe54000 - 00000000bfe5c000 > (ACPI data) > ... [reboot continues] > Dec 19 11:51:25 cluster init: Switching to runlevel: 6 The above is a manual shutdown. Somebody told the system to shut down at this point. > Dec 19 11:51:27 cluster rpc.statd[5886]: Caught signal 15, un-registering > and exiting. > Dec 19 11:51:27 cluster multipathd: --------shut down------- > Dec 19 11:51:27 cluster auditd[5740]: The audit daemon is exiting. > Dec 19 11:51:27 cluster kernel: audit(1292739687.831:17): audit_pid=0 > old=5740 by auid=4294967295 > Dec 19 11:51:27 cluster kernel: Kernel logging (proc) stopped. > Dec 19 11:51:27 cluster kernel: Kernel log daemon terminating. > Dec 19 11:51:29 cluster exiting on signal 15 Below is a new system restart as a result of the above manual shutdown. > Dec 19 11:55:48 cluster syslogd 1.4.1: restart. > Dec 19 11:55:48 cluster kernel: klogd 1.4.1, log source = /proc/kmsg > started. > Dec 19 11:55:48 cluster kernel: Linux version > 2.6.18-164.11.1.el5_lustre.1.8.3 (lbu...@x86-build-1) (gcc version 4.1.2 > 20080704 (Red Hat 4.1.2-46)) #1 SMP Fri Apr 9 18:00:39 MDT 2010 > Dec 19 11:55:48 cluster kernel: Command line: ro root=LABEL=/ rhgb quiet > crashkernel=1...@16m > Dec 19 11:55:48 cluster kernel: BIOS-provided physical RAM map: > Dec 19 11:55:48 cluster kernel: BIOS-e820: 0000000000010000 - > 000000000009ec00 (usable) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 000000000009ec00 - > 00000000000a0000 (reserved) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 00000000000f0000 - > 0000000000100000 (reserved) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 0000000000100000 - > 00000000bfe54000 (usable) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 00000000bfe54000 - > 00000000bfe5c000 (ACPI data) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 00000000bfe5c000 - > 00000000bfe5d000 (usable) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 00000000bfe5d000 - > 00000000c0000000 (reserved) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 00000000e0000000 - > 00000000f0000000 (reserved) > Dec 19 11:55:48 cluster kernel: BIOS-e820: 00000000fec00000 - > 00000000fed00000 (reserved) > Dec 19 11:55:49 cluster kernel: BIOS-e820: 00000000fee00000 - > 00000000fee10000 (reserved) > Dec 19 11:55:49 cluster kernel: BIOS-e820: 00000000ffc00000 - > 0000000100000000 (reserved) > Dec 19 11:55:49 cluster kernel: BIOS-e820: 0000000100000000 - > 000000043ffff000 (usable) > ... [continued system boot] Below is the start of Lustre being started... > Dec 19 19:49:37 cluster kernel: ldiskfs created from ext3-2.6-rhel5 > Dec 19 19:49:39 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:49:39 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:49:39 cluster kernel: LDISKFS FS on cciss/c3d1, internal journal > Dec 19 19:49:39 cluster kernel: LDISKFS-fs: recovery complete. > Dec 19 19:49:39 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:49:39 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:49:39 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:49:39 cluster kernel: LDISKFS FS on cciss/c3d1, internal journal > Dec 19 19:49:39 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:49:39 cluster kernel: LDISKFS-fs: file extents enabled > Dec 19 19:49:39 cluster kernel: LDISKFS-fs: mballoc enabled > Dec 19 19:49:39 cluster kernel: Lustre: mgc172.22.0....@o2ib: Reactivating > import > Dec 19 19:49:39 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost27_UUID' is not available for connect (no target) > Dec 19 19:49:39 cluster kernel: LustreError: > 30071:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff81041c57bc00 x1355357749683344/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768479 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:49:39 cluster kernel: Lustre: Filtering OBD driver; > http://www.lustre.org/ > Dec 19 19:49:40 cluster kernel: Lustre: > 30338:0:(filter.c:990:filter_init_server_data()) RECOVERY: service > dan3-OST0006, 267 recoverable clients, 0 delayed clients, last_rcvd > 120259807539 > Dec 19 19:49:40 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost27_UUID' is not available for connect (not set up) > Dec 19 19:49:40 cluster kernel: LustreError: Skipped 3 previous similar > messages > Dec 19 19:49:40 cluster kernel: LustreError: > 30072:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff81041dab7400 x1355357749628836/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768480 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:49:40 cluster kernel: LustreError: > 30072:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 3 previous similar > messages > Dec 19 19:49:40 cluster kernel: Lustre: dan3-OST0006: Now serving > dan3-OST0006 on /etc/dan/luns/lun72 with recovery enabled > Dec 19 19:49:40 cluster kernel: Lustre: dan3-OST0006: Will be in recovery > for at least 5:00, or until 267 clients reconnect > Dec 19 19:49:40 cluster kernel: Lustre: dan3-OST0006.ost: set parameter > quota_type=ug2 > Dec 19 19:49:40 cluster kernel: Lustre: > 30094:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0006: 266 > recoverable clients remain > Dec 19 19:49:40 cluster kernel: Lustre: > 30095:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0006: 265 > recoverable clients remain > Dec 19 19:49:40 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost28_UUID' is not available for connect (no target) > Dec 19 19:49:40 cluster kernel: LustreError: > 30133:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff81041b42e000 x1353539538627603/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768480 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:49:40 cluster kernel: LustreError: > 30133:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 25 previous > similar messages > Dec 19 19:49:40 cluster kernel: Lustre: > 30127:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0006: 261 > recoverable clients remain > Dec 19 19:49:40 cluster kernel: Lustre: > 30127:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 3 > previous similar messages > Dec 19 19:49:40 cluster kernel: LustreError: Skipped 28 previous similar > messages > Dec 19 19:49:41 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost28_UUID' is not available for connect (no target) > Dec 19 19:49:41 cluster kernel: LustreError: > 30165:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff81041e575050 x1353498440841407/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768481 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:49:41 cluster kernel: LustreError: > 30165:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 50 previous > similar messages > Dec 19 19:49:41 cluster kernel: Lustre: > 30142:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0006: 244 > recoverable clients remain > Dec 19 19:49:41 cluster kernel: Lustre: > 30142:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 16 > previous similar messages > Dec 19 19:49:41 cluster kernel: LustreError: Skipped 50 previous similar > messages > Dec 19 19:49:43 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost28_UUID' is not available for connect (no target) > Dec 19 19:49:43 cluster kernel: LustreError: > 30171:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff81041fd6e050 x1355357754857631/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768483 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:49:43 cluster kernel: LustreError: > 30171:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 41 previous > similar messages > Dec 19 19:49:43 cluster kernel: Lustre: > 30071:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0006: 230 > recoverable clients remain > Dec 19 19:49:43 cluster kernel: Lustre: > 30071:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 13 > previous similar messages > Dec 19 19:49:43 cluster kernel: LustreError: Skipped 41 previous similar > messages > Dec 19 19:49:48 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost28_UUID' is not available for connect (no target) > Dec 19 19:49:48 cluster kernel: LustreError: > 30192:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff81041b293800 x1353488992619011/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768488 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:49:48 cluster kernel: LustreError: > 30192:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 45 previous > similar messages > Dec 19 19:49:48 cluster kernel: Lustre: > 30077:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0006: 214 > recoverable clients remain > Dec 19 19:49:48 cluster kernel: Lustre: > 30077:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 15 > previous similar messages > Dec 19 19:49:48 cluster kernel: LustreError: Skipped 45 previous similar > messages > Dec 19 19:49:52 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:49:52 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:49:52 cluster kernel: LDISKFS FS on cciss/c3d3, internal journal > Dec 19 19:49:52 cluster kernel: LDISKFS-fs: recovery complete. > Dec 19 19:49:52 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:49:52 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:49:52 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:49:52 cluster kernel: LDISKFS FS on cciss/c3d3, internal journal > Dec 19 19:49:52 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:49:52 cluster kernel: LDISKFS-fs: file extents enabled > Dec 19 19:49:52 cluster kernel: LDISKFS-fs: mballoc enabled > Dec 19 19:49:52 cluster kernel: Lustre: > 30375:0:(filter.c:990:filter_init_server_data()) RECOVERY: service > dan3-OST0007, 267 recoverable clients, 0 delayed clients, last_rcvd > 98786412422 > Dec 19 19:49:52 cluster kernel: Lustre: dan3-OST0007: Now serving > dan3-OST0007 on /etc/dan/luns/lun74 with recovery enabled > Dec 19 19:49:52 cluster kernel: Lustre: dan3-OST0007: Will be in recovery > for at least 5:00, or until 267 clients reconnect > Dec 19 19:49:52 cluster kernel: Lustre: dan3-OST0007.ost: set parameter > quota_type=ug2 > Dec 19 19:49:56 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost37_UUID' is not available for connect (no target) > Dec 19 19:49:56 cluster kernel: LustreError: > 30090:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff810417b49400 x1355357743323540/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768496 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:49:56 cluster kernel: LustreError: > 30090:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 141 previous > similar messages > Dec 19 19:49:56 cluster kernel: Lustre: > 30160:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0007: 233 > recoverable clients remain > Dec 19 19:49:56 cluster kernel: Lustre: > 30160:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 91 > previous similar messages > Dec 19 19:49:56 cluster kernel: LustreError: Skipped 140 previous similar > messages > Dec 19 19:49:59 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:49:59 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:49:59 cluster kernel: LDISKFS FS on cciss/c4d1, internal journal > Dec 19 19:49:59 cluster kernel: LDISKFS-fs: recovery complete. > Dec 19 19:49:59 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:49:59 cluster kernel: LustreError: > 30046:0:(filter_log.c:135:filter_cancel_cookies_cb()) error cancelling log > cookies: rc = -19 > Dec 19 19:49:59 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:49:59 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:49:59 cluster kernel: LDISKFS FS on cciss/c4d1, internal journal > Dec 19 19:49:59 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:49:59 cluster kernel: LDISKFS-fs: file extents enabled > Dec 19 19:49:59 cluster kernel: LDISKFS-fs: mballoc enabled > Dec 19 19:49:59 cluster kernel: Lustre: > 30549:0:(filter.c:990:filter_init_server_data()) RECOVERY: service > dan4-OST0006, 260 recoverable clients, 0 delayed clients, last_rcvd > 120259121660 > Dec 19 19:49:59 cluster kernel: Lustre: dan4-OST0006: Now serving > dan4-OST0006 on /etc/dan/luns/lun76 with recovery enabled > Dec 19 19:49:59 cluster kernel: Lustre: dan4-OST0006: Will be in recovery > for at least 5:00, or until 260 clients reconnect > Dec 19 19:49:59 cluster kernel: Lustre: dan4-OST0006.ost: set parameter > quota_type=ug2 > Dec 19 19:50:11 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:50:11 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:50:11 cluster kernel: LDISKFS FS on cciss/c4d3, internal journal > Dec 19 19:50:11 cluster kernel: LDISKFS-fs: recovery complete. > Dec 19 19:50:11 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:50:12 cluster kernel: LustreError: 137-5: UUID > 'cluster-ost38_UUID' is not available for connect (no target) > Dec 19 19:50:12 cluster kernel: LustreError: Skipped 163 previous similar > messages > Dec 19 19:50:12 cluster kernel: Lustre: > 30438:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan4-OST0006: 184 > recoverable clients remain > Dec 19 19:50:12 cluster kernel: Lustre: > 30438:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 266 > previous similar messages > Dec 19 19:50:12 cluster kernel: LustreError: > 30192:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-19) r...@ffff81041999b400 x1353744256940738/t0 o8-><?>@<?>:0/0 lens 368/0 > e 0 to 0 dl 1292768512 ref 1 fl Interpret:/0/0 rc -19/0 > Dec 19 19:50:12 cluster kernel: LustreError: > 30192:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 164 previous > similar messages > Dec 19 19:50:12 cluster kernel: kjournald starting. Commit interval 5 > seconds > Dec 19 19:50:12 cluster kernel: LDISKFS-fs warning: maximal mount count > reached, running e2fsck is recommended > Dec 19 19:50:12 cluster kernel: LDISKFS FS on cciss/c4d3, internal journal > Dec 19 19:50:12 cluster kernel: LDISKFS-fs: mounted filesystem with ordered > data mode. > Dec 19 19:50:12 cluster kernel: LDISKFS-fs: file extents enabled > Dec 19 19:50:12 cluster kernel: LDISKFS-fs: mballoc enabled > Dec 19 19:50:12 cluster kernel: Lustre: > 30887:0:(filter.c:990:filter_init_server_data()) RECOVERY: service > dan4-OST0007, 260 recoverable clients, 0 delayed clients, last_rcvd > 98784319061 > Dec 19 19:50:12 cluster kernel: Lustre: dan4-OST0007: Now serving > dan4-OST0007 on /etc/dan/luns/lun78 with recovery enabled > Dec 19 19:50:12 cluster kernel: Lustre: dan4-OST0007: Will be in recovery > for at least 5:00, or until 260 clients reconnect > Dec 19 19:50:12 cluster kernel: Lustre: dan4-OST0007.ost: set parameter > quota_type=ug2 > Dec 19 19:50:45 cluster kernel: Lustre: > 30157:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) dan3-OST0006: 35 > recoverable clients remain > Dec 19 19:50:45 cluster kernel: Lustre: > 30157:0:(ldlm_lib.c:1788:target_queue_last_replay_reply()) Skipped 507 > previous similar messages > Dec 19 19:54:23 cluster kernel: Lustre: > 30391:0:(ldlm_lib.c:575:target_handle_reconnect()) dan3-OST0006: > 50459cf1-b8b1-640b-55b7-33ab3dd1aef9 reconnecting > Dec 19 19:54:23 cluster kernel: Lustre: > 30391:0:(ldlm_lib.c:875:target_handle_connect()) dan3-OST0006: refuse > reconnection from [email protected]@o2ib to > 0xffff81041d3c4c00; still busy with 1 active RPCs > Dec 19 19:54:23 cluster kernel: LustreError: > 30391:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@ processing error > (-16) r...@ffff810408887400 x1355357749617338/t0 > o8->50459cf1-b8b1-640b-55b7-33ab3dd1a...@net_0x50000ac16003a_uuid:0/0 lens > 368/264 e 0 to 0 dl 1292768763 ref 1 fl Interpret:/0/0 rc -16/0 > Dec 19 19:54:23 cluster kernel: LustreError: > 30391:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 1 previous similar > message > Dec 19 19:54:23 cluster kernel: Lustre: > 30924:0:(ldlm_lib.c:575:target_handle_reconnect()) dan3-OST0006: > 3db07881-c6e4-0cc9-a297-6b9e7eb56ce3 reconnecting > Dec 19 19:54:23 cluster kernel: Lustre: > 30924:0:(ldlm_lib.c:875:target_handle_connect()) dan3-OST0006: refuse > reconnection from [email protected]@o2ib to > 0xffff81041b911400; still busy with 1 active RPCs > Dec 19 19:54:23 cluster kernel: Lustre: > 30162:0:(ldlm_lib.c:575:target_handle_reconnect()) dan3-OST0006: > 1ce3ef8f-6a6d-192a-969e-f72160e11710 reconnecting > Dec 19 19:54:23 cluster kernel: Lustre: > 30162:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 2 previous > similar messages > Dec 19 19:54:23 cluster kernel: Lustre: > 30162:0:(ldlm_lib.c:875:target_handle_connect()) dan3-OST0006: refuse > reconnection from [email protected]@o2ib to > 0xffff81041d3c4200; still busy with 1 active RPCs > Dec 19 19:54:23 cluster kernel: Lustre: > 30162:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 2 previous similar > messages > Dec 19 19:54:40 cluster kernel: Lustre: dan3-OST0006: Recovery period over > after 5:00, of 267 clients 266 recovered and 1 was evicted. > Dec 19 19:54:40 cluster kernel: Lustre: dan3-OST0006: sending delayed > replies to recovered clients > Dec 19 19:54:40 cluster kernel: Lustre: dan3-OST0006: received MDS > connection from 172.22.0....@o2ib > Dec 19 19:54:48 cluster kernel: Lustre: > 30161:0:(ldlm_lib.c:575:target_handle_reconnect()) dan3-OST0006: > 50459cf1-b8b1-640b-55b7-33ab3dd1aef9 reconnecting > Dec 19 19:54:48 cluster kernel: Lustre: > 30161:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 2 previous > similar messages > Dec 19 19:54:52 cluster kernel: Lustre: dan3-OST0007: Recovery period over > after 5:00, of 267 clients 266 recovered and 1 was evicted. > Dec 19 19:54:52 cluster kernel: Lustre: dan3-OST0007: sending delayed > replies to recovered clients > Dec 19 19:54:52 cluster kernel: Lustre: dan3-OST0007: received MDS > connection from 172.22.0....@o2ib > Dec 19 19:55:00 cluster kernel: Lustre: dan4-OST0006: Recovery period over > after 5:00, of 260 clients 259 recovered and 1 was evicted. > Dec 19 19:55:00 cluster kernel: Lustre: dan4-OST0006: sending delayed > replies to recovered clients > Dec 19 19:55:00 cluster kernel: Lustre: dan4-OST0006: received MDS > connection from 172.22.0....@o2ib > Dec 19 19:55:12 cluster kernel: Lustre: dan4-OST0007: Recovery period over > after 5:00, of 260 clients 259 recovered and 1 was evicted. > Dec 19 19:55:12 cluster kernel: Lustre: dan4-OST0007: sending delayed > replies to recovered clients > Dec 19 19:55:12 cluster kernel: Lustre: dan4-OST0007: received MDS > connection from 172.22.0....@o2ib And lustre continues to start up. So I saw a total of two reboots. One was initiated by an operator and one looked spontaneous, like some kind of power outage event, or a "STONITH" by some heartbeat/failover software, not lustre. b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
