Andreas,
Thanks for the reply. I think that you are referring to the lustre
server and client rather than the mpi server (mother superior) and
the clients in your response below.
We are seeing many messages of this type which correspond to the job
being ended in PBS and a message sent to the job std out "FATAL from
PE 34: open_param_file: Input file INPUT/HIM_input does not exist."
I think you are saying that the lustre server is unresponsive, but
would like to confirm. We're not seeing many other messages which we
can tie to the job exit.
Jan
At 10:02 AM +0800 3/4/07, Andreas Dilger wrote:
On Mar 03, 2007 15:57 -0900, Jan H. Julian wrote:
We are starting to investigate extremely slow performance on one of
our test jobs using lustre.1.4.7.1 and have encountered the following
error message in the job output:
>Mar 3 09:45:52 mt006 kernel: LustreError:
>2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue
>objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4
Mar 3 09:45:52 mt006 kernel: LustreError:
2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2
previous similar messages
This is -4 = -EINTR (from /usr/include/asm/errno.h), so it just means
your job was killed with CTRL-C when it was stuck. The server was
not responsive to the client and should be investigated.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
--
Jan Julian University of Alaska, ARSC mailto:[EMAIL PROTECTED]
(907) 450-8641 910 Yukon Drive, Suite 001 http://www.arsc.edu
Fax: 450-8605 Fairbanks, AK 99775-6020 USA
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss