Hello! I don't think it is a timeout issue any longer. The timeout value is the same for all of the Lustre systems mounted via our MGS/MDS system. The value is rather high. It is currently 1000. The value I got from "cat /proc/sys/lustre/timeout on the MGS/MDS box.
I changed the IB cable on the problem box using the same IB card, PCI slot and slot on the IB SilverStorm switch. The errors I now see on the clients are the same but the server OSS for crew8-OST0000 thru crew8-OST0011 are: ib0: multicast join failed for ff12:401b:ffff:0000:0000:0000:ffff:ffff, status -22 LustreError: 4346:0:(filter.c:2674:filter_destroy_precreated()) LustreError:4486:0:(ldmlm_lib.c:1442:target_send_reply_msg(()@@processing error -107 Perhaps it could be the IB card? It is a Mellanox Technologies MT25204 [InfiniHost III Lx HCA]. This is the same card in many, but not all, of our other systems. I can try a new IB card on Monday. On the OSS, the following lines repeat every two minutes (from /var/log/messages): ep 20 22:20:32 oss4 kernel: LustreError: 3775:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1221963532, 100s ago) [EMAIL PROTECTED] x46975/t0 o250->[EMAIL PROTECTED]@o2ib_0:26 lens 304/328 ref 1 fl Rpc:/0/0 rc 0/-22 Sep 20 22:20:32 oss4 kernel: LustreError: 3775:0:(client.c:975:ptlrpc_expire_one_request()) Skipped 5 previous similar messages Thank you, megan On Sat, Sep 20, 2008 at 6:23 PM, Andreas Dilger <[EMAIL PROTECTED]> wrote: > On Sep 18, 2008 14:04 -0400, Ms. Megan Larko wrote: >> /dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010 >> /dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011 >> >> 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5 >> 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5 >> >> (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and >> not crew8-OST0010 and crew8-OST0011 respectively. I don't know if >> that has anything at all to do with my issue.) > > Hmm, that is a bit strange, I don't know that I've seen this before. > >> crew8-OST0003-osc-ffff81083ea5c400: Connection to service >> crew8-OST0003 via nid [EMAIL PROTECTED] was lost; in progress >> crew8-OST0003-osc-ffff81083ea5c400: Connection to service >> crew8-OST0003 via nid [EMAIL PROTECTED] was lost; in progress >> >> The MGS/MDS /var/log/messages reads: >> [EMAIL PROTECTED] ~]# tail /var/log/messages >> Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to >> service crew8-OST0005 via nid [EMAIL PROTECTED] was lost; in progress >> >> So---I am seeing that OSS4 is repeatedly losing its network contact >> with MGS/MDS machine mds1. > > It is also losing connection to the crew01 client, I'd suspect some > kind of network problem (e.g. cable). >> >> I am guessing that I need to increase a lustre client timeout value >> for our o2ib connections for the new disk to not generate these >> messages (the /crewdat disk itself seems to be fine for user access). > > This seems unlikely, unless you have a large cluster (e.g. 500+ clients). > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
