On Wed, Jul 15, 2009 at 11:22:26AM -0400, Robin Humble wrote:
>On Wed, Jul 15, 2009 at 10:10:06AM -0400, Brian J. Murrell wrote:
>>On Wed, 2009-07-15 at 08:46 -0400, Robin Humble wrote:
>>> 
>>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>>   Lustre: 13799:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: mgc10.8.30....@o2ib: Reactivating import
>>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: Client system-client has started
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>>   ... last message repeated 17 times ...
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 13798:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 13797:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 615:0:(o2iblnd_cb.c:2384:kiblnd_reconnect()) 10.8.30....@o2ib: 
>>> retrying (version negotiation), 12, 11, queue_dep: 8, max_frag: 256, 
>>> msg_size: 4096
>>>   Lustre: 13800:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 
>>> 10.8.30....@o2ib failed: 5
>>
>>> looks like it succeeds in the end, but only after a struggle.
>>Is it completely stable and performant after the struggle?  Do the error
>>messages stop?
>the fs's appear to be fine.

hmmm - actually, the fs's are _mostly_ fine... but sometimes i/o that
happens right after the above errors fails completely. after a few more
trials, this seems to happen about 40% of the time... :-/

eg. (+/- some missing characters from crappy IPMI SoL) you can see that
rsync has managed to list the files on the newly mounted lustre fs, but
then gets i/o errors when trying to copy the files off lustre to ramdisk ->

...
rsync: readlink "/mnt/lustre_system/lib64/libattr.so.1.1.0.1.0" failed: 
Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libc-2.5.so" faile(5)
rsync: readlink "/mnt/lustre_system/lib64/libcrypt-2.5.so" failed: Input/output 
error (5)
rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.a.1.02" failed: 
Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libdevmapper-event.so.1.02" failed: 
Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libexpat.so.0.5.0" failed: 
Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib6-2.0.a" failed: Input/output error (5)
rsync: readlink "/mnt/lustre_system/lib64/libgmodule-2.0.aject-2.0.a" failed: 
Input/output error (5)
...

so maybe lnet has renegotiated a connection to the MDS ok, but not to
the OSS's yet.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to