On Sep 18, 2008 14:04 -0400, Ms. Megan Larko wrote: > /dev/sdk1 6.3T 878G 5.1T 15% /srv/lustre/OST/crew8-OST0010 > /dev/sdk2 6.3T 891G 5.1T 15% /srv/lustre/OST/crew8-OST0011 > > 25 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5 > 26 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5 > > (NOTE: last two disks came in as crew8-OST000a and crew8-OST000b and > not crew8-OST0010 and crew8-OST0011 respectively. I don't know if > that has anything at all to do with my issue.)
Hmm, that is a bit strange, I don't know that I've seen this before. > crew8-OST0003-osc-ffff81083ea5c400: Connection to service > crew8-OST0003 via nid [EMAIL PROTECTED] was lost; in progress > crew8-OST0003-osc-ffff81083ea5c400: Connection to service > crew8-OST0003 via nid [EMAIL PROTECTED] was lost; in progress > > The MGS/MDS /var/log/messages reads: > [EMAIL PROTECTED] ~]# tail /var/log/messages > Sep 18 13:50:58 mds1 kernel: Lustre: crew8-OST0005-osc: Connection to > service crew8-OST0005 via nid [EMAIL PROTECTED] was lost; in progress > > So---I am seeing that OSS4 is repeatedly losing its network contact > with MGS/MDS machine mds1. It is also losing connection to the crew01 client, I'd suspect some kind of network problem (e.g. cable). > > I am guessing that I need to increase a lustre client timeout value > for our o2ib connections for the new disk to not generate these > messages (the /crewdat disk itself seems to be fine for user access). This seems unlikely, unless you have a large cluster (e.g. 500+ clients). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
