Jeremy
At 04:39 PM 4/22/2004, Brandi L. Winfrey wrote:
After running the command "pbsnodes -a", I determined that oscarnode4 was not responding properly:
oscarnode4.oscardomain state = state-unknown,down np = 1 properties = all ntype = cluster
I SSHed to oscarnode4 and checked to see if PBS was running on it (it appears to be running):
[EMAIL PROTECTED] bwinfrey]$ ps -ef | grep pbs root 610 1 0 10:21 ? 00:00:00 /opt/pbs/sbin/pbs_mom -r bwinfrey 971 898 0 15:33 pts/0 00:00:00 grep pbs [EMAIL PROTECTED] bwinfrey]$
I removed oscarnode4 from my machinefile on the head node in the directory where I ran the code from, tried running it again, and it worked. I don't understand why oscarnode4 can't be used though. It is up and running. I can SSH to it. It is running PBS. Why can't I use it? very strange.
Brandi
-----Original Message----- From: Brandi L. Winfrey [mailto:[EMAIL PROTECTED] Sent: Thursday, April 22, 2004 3:46 PM To: '[EMAIL PROTECTED]' Subject: Re: [Oscar-users] mpich
I have OSCAR 3.0 running on RedHat7.3 and have been running an MPI code successfully on it until today when it no longer runs. The problem is not with the code, but that MPI? PBS? is not connecting with the slave processes. The error I get is this:
--------------------------------------------------------------------------- [EMAIL PROTECTED] ashplumeP_v1]$ mpirun -np 8 -machinefile Oscar.machine -v ashplumeP test.out test.in grid GRID running /home/bwinfrey/ashplumeP_v1/ashplumeP on 8 LINUX ch_p4 processors Created /home/bwinfrey/ashplumeP_v1/PI6490 p1_958: p4_error: Timeout in establishing connection to remote process: 0 bm_list_6701: (313.881659) wakeup_slave: unable to interrupt slave 0 pid 6700 bm_list_6701: (313.882068) wakeup_slave: unable to interrupt slave 0 pid 6700 bm_list_6701: (313.882278) wakeup_slave: unable to interrupt slave 0 pid 6700 p5_869: p4_error: net_recv read: probable EOF on socket: 1 [EMAIL PROTECTED] ashplumeP_v1]$ p3_953: p4_error: net_recv read: probable EOF on socket: 1 p7_869: p4_error: net_recv read: probable EOF on socket: 1 p4_865: p4_error: net_recv read: probable EOF on socket: 1 p6_872: p4_error: net_recv read: probable EOF on socket: 1 bm_list_6701: (313.887301) wakeup_slave: unable to interrupt slave 0 pid 6700 bm_list_6701: (313.887734) wakeup_slave: unable to interrupt slave 0 pid 6700 ---------------------------------------------------------------------------
That I know of, I have not intentionally disabled or changed any services. This morning I compressed all of the log files in /var/spool/pbs/server_logs into .tar.gz files and deleted some of the older files because they were taking up too much space.
Is there a process that might cause these errors if it were not running? I CAN SSH to all of the client nodes and it looks like pbs is running:
--------------------------------------------------------------------------- [EMAIL PROTECTED] ashplumeP_v1]$ ps -ef | grep pbs root 2077 1 0 11:18 ? 00:00:00 /opt/pbs/sbin/pbs_mom -r root 2085 1 0 11:18 ? 00:00:10 /opt/pbs/sbin/pbs_server bwinfrey 6760 5441 0 15:27 pts/0 00:00:00 grep pbs ---------------------------------------------------------------------------
any suggestions?
Brandi
------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
