On Wed, Jul 15, 2009 at 11:22:26AM -0400, Robin Humble wrote: >On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote: >>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote: >>> >>> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >>> Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: mgc10.8.30....@o2ib: Reactivating import >>> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: Client system-client has started >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >>> ... last message repeated 17 times ... >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: >>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, >>> msg_size: 4096 >>> Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from >>> 10.8.30....@o2ib failed: 5 >> >>> looks like it succeeds in the end, but only after a struggle. >>Is it completely stable and performant after the struggle? Do the error >>messages stop? >the fs's appear to be fine.
hmmm - actually, the fs's are _mostly_ fine... but sometimes i/o that happens right after the above errors fails completely. after a few more trials, this seems to happen about 40% of the time... :-/ eg. (+/- some missing characters from crappy IPMI SoL) you can see that rsync has managed to list the files on the newly mounted lustre fs, but then gets i/o errors when trying to copy the files off lustre to ramdisk -> ... rsync: readlink "/mnt/lustre_system/lib64/libattr.so.1.1.0.1.0" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libc-2.5.so" faile(5) rsync: readlink "/mnt/lustre_system/lib64/libcrypt-2.5.so" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.a.1.02" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.so.1.02" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libexpat.so.0.5.0" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib6-2.0.a" failed: Input/output error (5) rsync: readlink "/mnt/lustre_system/lib64/libgmodule-2.0.aject-2.0.a" failed: Input/output error (5) ... so maybe lnet has renegotiated a connection to the MDS ok, but not to the OSS's yet. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
