Has oscarnode4 been rebuilt by chance? On a fresh rebuild before a post_install is run, I believe ssh is the only service open by pfilter. This would certainly block mpi functionality.

Jeremy

At 04:39 PM 4/22/2004, Brandi L. Winfrey wrote:
After running the command "pbsnodes -a", I determined
that oscarnode4 was not responding properly:

     oscarnode4.oscardomain
     state = state-unknown,down
     np = 1
     properties = all
     ntype = cluster

I SSHed to oscarnode4 and checked to see if PBS was
running on it (it appears to be running):

[EMAIL PROTECTED] bwinfrey]$ ps -ef | grep pbs
root       610     1  0 10:21 ?        00:00:00 /opt/pbs/sbin/pbs_mom -r
bwinfrey   971   898  0 15:33 pts/0    00:00:00 grep pbs
[EMAIL PROTECTED] bwinfrey]$

I removed oscarnode4 from my machinefile on the head node in
the directory where I ran the code from, tried running it again,
and it worked.  I don't understand why oscarnode4 can't be
used though.  It is up and running.  I can SSH to it.  It is
running PBS.  Why can't I use it?  very strange.

Brandi





-----Original Message-----
From: Brandi L. Winfrey [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 22, 2004 3:46 PM
To: '[EMAIL PROTECTED]'
Subject: Re: [Oscar-users] mpich


I have OSCAR 3.0 running on RedHat7.3 and have been running an MPI code successfully on it until today when it no longer runs. The problem is not with the code, but that MPI? PBS? is not connecting with the slave processes. The error I get is this:

---------------------------------------------------------------------------
[EMAIL PROTECTED] ashplumeP_v1]$ mpirun -np 8 -machinefile Oscar.machine -v
ashplumeP test.out test.in grid GRID
running /home/bwinfrey/ashplumeP_v1/ashplumeP on 8 LINUX ch_p4 processors
Created /home/bwinfrey/ashplumeP_v1/PI6490
p1_958:  p4_error: Timeout in establishing connection to remote process: 0
bm_list_6701: (313.881659) wakeup_slave: unable to interrupt slave 0 pid
6700
bm_list_6701: (313.882068) wakeup_slave: unable to interrupt slave 0 pid
6700
bm_list_6701: (313.882278) wakeup_slave: unable to interrupt slave 0 pid
6700
p5_869:  p4_error: net_recv read:  probable EOF on socket: 1
[EMAIL PROTECTED] ashplumeP_v1]$ p3_953:  p4_error: net_recv read:  probable
EOF on socket: 1
p7_869:  p4_error: net_recv read:  probable EOF on socket: 1
p4_865:  p4_error: net_recv read:  probable EOF on socket: 1
p6_872:  p4_error: net_recv read:  probable EOF on socket: 1
bm_list_6701: (313.887301) wakeup_slave: unable to interrupt slave 0 pid
6700
bm_list_6701: (313.887734) wakeup_slave: unable to interrupt slave 0 pid
6700
---------------------------------------------------------------------------

That I know of, I have not intentionally disabled or changed any services.
This morning I compressed all of the log files in /var/spool/pbs/server_logs
into .tar.gz files and deleted some of the older files because they were
taking up too much space.

Is there a process that might cause these errors if it were not running?
I CAN SSH to all of the client nodes and it looks like pbs is running:

---------------------------------------------------------------------------
[EMAIL PROTECTED] ashplumeP_v1]$ ps -ef | grep pbs
root      2077     1  0 11:18 ?        00:00:00 /opt/pbs/sbin/pbs_mom -r
root      2085     1  0 11:18 ?        00:00:10 /opt/pbs/sbin/pbs_server
bwinfrey  6760  5441  0 15:27 pts/0    00:00:00 grep pbs
---------------------------------------------------------------------------

any suggestions?

Brandi



------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to