On Mar 12, 2007  15:44 -0800, Jan H. Julian wrote:
> Thanks for the reply.  I think that you are referring to the lustre 
> server and client rather than the mpi server (mother superior) and 
> the clients in your response below.

Correct.

> We are seeing many messages of this type which correspond to the job 
> being ended in PBS and a message sent to the job std out "FATAL from 
> PE   34: open_param_file: Input file INPUT/HIM_input does not exist." 
> I think you are saying that the lustre server is unresponsive, but 
> would like to confirm.  We're not seeing many other messages which we 
> can tie to the job exit.

In particular, this Lustre client is reporting that while waiting for
a lock enqueue to OST index #21 the process was killed.  That would
lead investigation to whatever node OST index #21 is on to verify
its health.

> At 10:02 AM +0800 3/4/07, Andreas Dilger wrote:
> >On Mar 03, 2007  15:57 -0900, Jan H. Julian wrote:
> >> We are starting to investigate extremely slow performance on one of
> >> our test jobs using lustre.1.4.7.1 and have encountered the following
> >> error message in the job output:
> >>
> >> >Mar  3 09:45:52 mt006 kernel: LustreError:
> >> >2463:0:(lov_request.c:180:lov_update_enqueue_set()) error: enqueue
> >> >objid 0x3920c71 subobj 0x101fe on OST idx 21: rc = -4
> >> Mar  3 09:45:52 mt006 kernel: LustreError:
> >> 2463:0:(lov_request.c:180:lov_update_enqueue_set()) Skipped 2
> >> previous similar messages
> >
> >This is -4 = -EINTR (from /usr/include/asm/errno.h), so it just means
> >your job was killed with CTRL-C when it was stuck.  The server was
> >not responsive to the client and should be investigated.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to