I've combined the questions and answers from several responses into this.
Thank you everyone for your prompt response and help.

>What distro and release are you using?  There was a RH release, maybe 7.3
or 8, that had a problem like this.  IIRC, upgrading the kernel corrected
the problem.

>Regardless, please supply the distro release and Oscar version you are
using.

ANS:We're using OSCAR version 2.3.1 with RH Linux 8.  Obviously we'll try an
upgrade to the latest.

What is your recommendation - Upgrade OS first or upgrade OSCAR first?

Is there a procedure for upgrading either or both that will maintain the
integrity of the cluster?

>Also, what do the entries in the headnode's log look like around the time
of the not responding messages in the compute nodes?

ANS:Can you point me to where the log is maintained?  My original post
contained some log messages although I don't know from what log they came
from as it was provided to me by someone else in direct contact with the
machine. Do those help?

>Nature of the computational problem and the architecture of the cluster
would
>have a significant bearing on the solution to your problem I think.  I
would
>guess offhand its a bandwith issue since the NFS does serve requests
>eventually.  If you have a really large cluster or a very file writing
>intensive computational problem you can tie up the NFS server fairly
quickly.

>About how many nodes are you using?

ANS:Master plus 7 nodes.

>Did you check the network switch?

ANS:The switch seems to be operating normally. Cables intact etc.



> >> We noticed this when our big jobs (need 3 days) were
> >> restarted. From the log, it was happening more and more
> >> frequently. Any suggestion on identifying the source of
> >> the problem?
> >>
> >>
> >> [EMAIL PROTECTED] log]# grep nfs messages | grep not | wc -l
> >>      339
> >> [EMAIL PROTECTED] log]# grep nfs messages.1 | grep not | wc -l----- Original
Message ----- 

> >[MORE INFORMATION]
> >

> >>      381
> >> [EMAIL PROTECTED] log]# grep nfs messages.2 | grep not | wc -l
> >>        9
> >> [EMAIL PROTECTED] log]# grep nfs messages.3 | grep not | wc -l
> >>        0
> >> [EMAIL PROTECTED] log]# grep nfs messages.4 | grep not | wc -l
> >>        0
> >> [EMAIL PROTECTED] log]# ls -l messages*
> >> -rw-------    1 root     root       130926 Dec 29 15:45
> >> messages
> >> -rw-------    1 root     root       509784 Dec 28 04:02
> >> messages.1
> >> -rw-------    1 root     root       416508 Dec 21 04:02
> >> messages.2
> >> -rw-------    1 root     root       586158 Dec 14 04:02
> >> messages.3
> >> -rw-------    1 root     root       413372 Dec  7 04:02
> >> messages.4
> >
> >[ ... more message ....]
> >messages:Dec 28 04:05:06 node7.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:05:30 node3.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:05:46 node2.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:05:57 node2.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:06:09 node3.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:06:34 node7.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:06:58 node7.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:07:48 node3.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:08:20 node2.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:09:01 node2.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:09:29 node7.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:09:53 node3.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:10:10 node3.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:10:38 node2.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:10:47 node3.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:11:06 node2.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:11:25 node3.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:11:35 node2.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:11:42 node3.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:11:47 node2.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:11:51 node3.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:12:08 node2.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:12:35 node2.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:13:14 node3.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:14:27 node3.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:15:36 node7.metis kernel: nfs: server nfs_oscar not
> responding, still trying
> >messages:Dec 28 04:16:30 node3.metis kernel: nfs: server nfs_oscar OK
> >messages:Dec 28 04:17:27 node7.metis kernel: nfs: server nfs_oscar OK >
> >[snip]
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: IBM Linux Tutorials.
> Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
> Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
> Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
> _______________________________________________
> Oscar-users mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>




-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to