Andreas,

Thanks for the reply. I think that you are referring to the lustre server and client rather than the mpi server (mother superior) and the clients in your response below.

We are seeing many messages of this type which correspond to the job being ended in PBS and a message sent to the job std out "FATAL from PE 34: open_param_file: Input file INPUT/HIM_input does not exist." I think you are saying that the lustre server is unresponsive, but would like to confirm. We're not seeing many other messages which we can tie to the job exit.

Jan

At 10:02 AM +0800 3/4/07, Andreas Dilger wrote:
On Mar 03, 2007  15:57 -0900, Jan H. Julian wrote:
 We are starting to investigate extremely slow performance on one of
 our test jobs using lustre.1.4.7.1 and have encountered the following
 error message in the job output:

 >Mar  3 09:45:52 mt006 kernel: LustreError:
 >2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue
 >objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4
 Mar  3 09:45:52 mt006 kernel: LustreError:
 2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2
 previous similar messages

This is -4 = -EINTR (from /usr/include/asm/errno.h), so it just means
your job was killed with CTRL-C when it was stuck.  The server was
not responsive to the client and should be investigated.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


--
Jan Julian     University of Alaska, ARSC    mailto:[EMAIL PROTECTED]
(907) 450-8641  910 Yukon Drive, Suite 001    http://www.arsc.edu
Fax: 450-8605  Fairbanks, AK 99775-6020 USA

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to