On Mar 12, 2007 15:44 -0800, Jan H. Julian wrote: > Thanks for the reply. I think that you are referring to the lustre > server and client rather than the mpi server (mother superior) and > the clients in your response below.
Correct. > We are seeing many messages of this type which correspond to the job > being ended in PBS and a message sent to the job std out "FATAL from > PE 34: open_param_file: Input file INPUT/HIM_input does not exist." > I think you are saying that the lustre server is unresponsive, but > would like to confirm. We're not seeing many other messages which we > can tie to the job exit. In particular, this Lustre client is reporting that while waiting for a lock enqueue to OST index #21 the process was killed. That would lead investigation to whatever node OST index #21 is on to verify its health. > At 10:02 AM +0800 3/4/07, Andreas Dilger wrote: > >On Mar 03, 2007 15:57 -0900, Jan H. Julian wrote: > >> We are starting to investigate extremely slow performance on one of > >> our test jobs using lustre.1.4.7.1 and have encountered the following > >> error message in the job output: > >> > >> >Mar 3 09:45:52 mt006 kernel: LustreError: > >> >2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue > >> >objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4 > >> Mar 3 09:45:52 mt006 kernel: LustreError: > >> 2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2 > >> previous similar messages > > > >This is -4 = -EINTR (from /usr/include/asm/errno.h), so it just means > >your job was killed with CTRL-C when it was stuck. The server was > >not responsive to the client and should be investigated. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
