I don't know what the error is, but it doesn't look like a PBS error. Have you tried running outside of the PBS environment? Are you sure there have been no other changes since your last successful run?
Jeremy
At 03:46 PM 4/22/2004, Brandi L. Winfrey wrote:
I have OSCAR 3.0 running on RedHat7.3 and have been running an MPI code successfully on it until today when it no longer runs. The problem is not with the code, but that MPI? PBS? is not connecting with the slave processes. The error I get is this:
--------------------------------------------------------------------------- [EMAIL PROTECTED] ashplumeP_v1]$ mpirun -np 8 -machinefile Oscar.machine -v ashplumeP test.out test.in grid GRID running /home/bwinfrey/ashplumeP_v1/ashplumeP on 8 LINUX ch_p4 processors Created /home/bwinfrey/ashplumeP_v1/PI6490 p1_958: p4_error: Timeout in establishing connection to remote process: 0 bm_list_6701: (313.881659) wakeup_slave: unable to interrupt slave 0 pid 6700 bm_list_6701: (313.882068) wakeup_slave: unable to interrupt slave 0 pid 6700 bm_list_6701: (313.882278) wakeup_slave: unable to interrupt slave 0 pid 6700 p5_869: p4_error: net_recv read: probable EOF on socket: 1 [EMAIL PROTECTED] ashplumeP_v1]$ p3_953: p4_error: net_recv read: probable EOF on socket: 1 p7_869: p4_error: net_recv read: probable EOF on socket: 1 p4_865: p4_error: net_recv read: probable EOF on socket: 1 p6_872: p4_error: net_recv read: probable EOF on socket: 1 bm_list_6701: (313.887301) wakeup_slave: unable to interrupt slave 0 pid 6700 bm_list_6701: (313.887734) wakeup_slave: unable to interrupt slave 0 pid 6700 ---------------------------------------------------------------------------
That I know of, I have not intentionally disabled or changed any services. This morning I compressed all of the log files in /var/spool/pbs/server_logs into .tar.gz files and deleted some of the older files because they were taking up too much space.
Is there a process that might cause these errors if it were not running? I CAN SSH to all of the client nodes and it looks like pbs is running:
--------------------------------------------------------------------------- [EMAIL PROTECTED] ashplumeP_v1]$ ps -ef | grep pbs root 2077 1 0 11:18 ? 00:00:00 /opt/pbs/sbin/pbs_mom -r root 2085 1 0 11:18 ? 00:00:10 /opt/pbs/sbin/pbs_server bwinfrey 6760 5441 0 15:27 pts/0 00:00:00 grep pbs ---------------------------------------------------------------------------
any suggestions?
Brandi
------------------------------------------------------- This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek For a limited time only, get FREE Ground shipping on all orders of $35 or more. Hurry up and shop folks, this offer expires April 30th! http://www.thinkgeek.com/freeshipping/?cpg=12297 _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
