I've combined the questions and answers from several responses into this. Thank you everyone for your prompt response and help.
>What distro and release are you using? There was a RH release, maybe 7.3 or 8, that had a problem like this. IIRC, upgrading the kernel corrected the problem. >Regardless, please supply the distro release and Oscar version you are using. ANS:We're using OSCAR version 2.3.1 with RH Linux 8. Obviously we'll try an upgrade to the latest. What is your recommendation - Upgrade OS first or upgrade OSCAR first? Is there a procedure for upgrading either or both that will maintain the integrity of the cluster? >Also, what do the entries in the headnode's log look like around the time of the not responding messages in the compute nodes? ANS:Can you point me to where the log is maintained? My original post contained some log messages although I don't know from what log they came from as it was provided to me by someone else in direct contact with the machine. Do those help? >Nature of the computational problem and the architecture of the cluster would >have a significant bearing on the solution to your problem I think. I would >guess offhand its a bandwith issue since the NFS does serve requests >eventually. If you have a really large cluster or a very file writing >intensive computational problem you can tie up the NFS server fairly quickly. >About how many nodes are you using? ANS:Master plus 7 nodes. >Did you check the network switch? ANS:The switch seems to be operating normally. Cables intact etc. > >> We noticed this when our big jobs (need 3 days) were > >> restarted. From the log, it was happening more and more > >> frequently. Any suggestion on identifying the source of > >> the problem? > >> > >> > >> [EMAIL PROTECTED] log]# grep nfs messages | grep not | wc -l > >> 339 > >> [EMAIL PROTECTED] log]# grep nfs messages.1 | grep not | wc -l----- Original Message ----- > >[MORE INFORMATION] > > > >> 381 > >> [EMAIL PROTECTED] log]# grep nfs messages.2 | grep not | wc -l > >> 9 > >> [EMAIL PROTECTED] log]# grep nfs messages.3 | grep not | wc -l > >> 0 > >> [EMAIL PROTECTED] log]# grep nfs messages.4 | grep not | wc -l > >> 0 > >> [EMAIL PROTECTED] log]# ls -l messages* > >> -rw------- 1 root root 130926 Dec 29 15:45 > >> messages > >> -rw------- 1 root root 509784 Dec 28 04:02 > >> messages.1 > >> -rw------- 1 root root 416508 Dec 21 04:02 > >> messages.2 > >> -rw------- 1 root root 586158 Dec 14 04:02 > >> messages.3 > >> -rw------- 1 root root 413372 Dec 7 04:02 > >> messages.4 > > > >[ ... more message ....] > >messages:Dec 28 04:05:06 node7.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:05:30 node3.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:05:46 node2.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:05:57 node2.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:06:09 node3.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:06:34 node7.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:06:58 node7.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:07:48 node3.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:08:20 node2.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:09:01 node2.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:09:29 node7.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:09:53 node3.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:10:10 node3.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:10:38 node2.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:10:47 node3.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:11:06 node2.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:11:25 node3.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:11:35 node2.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:11:42 node3.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:11:47 node2.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:11:51 node3.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:12:08 node2.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:12:35 node2.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:13:14 node3.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:14:27 node3.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:15:36 node7.metis kernel: nfs: server nfs_oscar not > responding, still trying > >messages:Dec 28 04:16:30 node3.metis kernel: nfs: server nfs_oscar OK > >messages:Dec 28 04:17:27 node7.metis kernel: nfs: server nfs_oscar OK > > >[snip] > > > > ------------------------------------------------------- > This SF.net email is sponsored by: IBM Linux Tutorials. > Become an expert in LINUX or just sharpen your skills. Sign up for IBM's > Free Linux Tutorials. Learn everything from the bash shell to sys admin. > Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click > _______________________________________________ > Oscar-users mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/oscar-users > ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
