On Mar 9, 2008, at 10:01 PM, Aaron Knister wrote: > Hi! I have a few questions for you- > > 1. How many nodes was his job running on?
around 64 serial jobs accessing the same directory (not the same files). > 2. What version of lustre and linux kernel are you running on your > servers/clients? Lustre servers: 2.6.9-55.0.9.EL_lustre.1.6.4.1smp Clients: 2.6.9-67.0.1.ELsmp > 3. What ethernet module are you using on the servers/clients? Most use the tg3, some use e1000. > > I honestly am not sure what the RPC errors mean but I've had > similar issues caused by ethernet-level errors. Over the weekend the MDS/MGS went into a unhealthy state forced a reboot+fsck and when it came back up the directory was accessible again and jobs started working again. > > -Aaron > > On Mar 7, 2008, at 6:45 PM, Brock Palen wrote: > >> On a file system thats been up for only 57 days, I have: >> >> 505 lustre-log. dumps. >> >> THe problem at hand is a user has many jobs where his jobs are now >> hung trying to create a directory from his pbs script. On the >> clients i see: >> >> LustreError: 11-0: an error occurred while communicating with >> [EMAIL PROTECTED] The mds_connect operation failed with -16 >> LustreError: Skipped 2 previous similar messages >> >> On every client his jobs are on. >> >> In the most recent /tmp/lustre-log. on the MDS/MGS I see this >> message: >> >> @@@ processing error (-16) [EMAIL PROTECTED] x12808293/t0 o38- >>> [EMAIL PROTECTED]:-1 >> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 >> ldlm_lib.c >> target_handle_reconnect >> nobackup-MDT0000: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting >> ldlm_lib.c >> target_handle_connect >> nobackup-MDT0000: refuse reconnection from 34b4fbea-200b-1f7c- >> [EMAIL PROTECTED]@tcp to 0x00000100069a7000; still busy >> with 2 active RPCs >> ldlm_lib.c >> target_send_reply_msg >> @@@ processing error (-16) [EMAIL PROTECTED] x11199816/t0 o38- >>> [EMAIL PROTECTED]:-1 >> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 >> >> >> What I see messages about active rpc's in other logs. What would >> this mean? Is something suck someplace ? >> >> >> >> Brock Palen >> Center for Advanced Computing >> [EMAIL PROTECTED] >> (734)936-1985 >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > [EMAIL PROTECTED] > > > > > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
