>From [EMAIL PROTECTED] Sat May 29 12:20:25 2004
Date: Sat, 29 May 2004 11:08:31 -0500 (CDT)
From: Michael McKee <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
Subject: Re: "send POLL failed" problem under PBS

Alex,

I have run the same code for months on our cluster.
After we upgraded to PBSPro5.4 (the only change to the
system), I started to have these "send POLL failed"
on a regular basis.  I was really hoping that you
had a fix.  Anyway, my suggestion would to
to go back to PBSPro5.3.x which was very stable (for me).

If you hear more about this problem, please let me know.

Yours,
Mike

>Date: Sat, 29 May 2004 11:56:21 -0400 (EDT)
>From: Alexander V Shirokov <[EMAIL PROTECTED]>
>To: Michael McKee <[EMAIL PROTECTED]>
>cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
>Dear Mike,
>Disabling X11 did not help.
>I did not find any way to solve this problem, except
>reducing the problem size from 1024^3 to 800^3 particles,
>which is not good, since the cluster size
>memory and CPU requirements allows for the  1024^3
>problem size with a little swap memory usage. I spent a lot of time
>in vain looking for a bug in the code.
>I am using PBSPro5.4  (pbs-5.4.0.40152-0 rpm package),
>maybe it might help to downgrade to earlier version?
>Jeremy Enos also suggested that this might
>be due to "ulimit problem". I checked on my
>system by typing "cexec ulimit" and every
>node shows "unlimited".
>Regards,
>Alex

>From [EMAIL PROTECTED] Sat May 29 12:20:19 2004
Date: Sat, 29 May 2004 10:29:54 -0500 (CDT)
From: Michael McKee <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: "send POLL failed" problem under PBS

Dear Alex,

I found your email message on a web site concerning your jobs
that failed under PBS.  The nature of failing is exactly the
same as the problem that I have.  The response to your message
was to add "ForwardX11 no" to ssh but I have my doubts that
would work.  Actually, the problem with "send POLL failed"
started just after we upgraded to PBSPro5.4.  You did not
mention which version of PBS you were using.  Was it PBSPro5.4?

Have you made any progress in resolving this issue?

Thanks in advance,
Mike McKee, Professor of Chemistry
Auburn University
Auburn, AL

--here is my email to my sysadm-----
Phil,
My Gaussian jobs have been failing after several hours of computing
due to "send POLL failed".  I have no idea what causes this error
but I feel it must be related to the new PBS version.  Did
you make any other changes to the system?
See the following URL:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg03686.html

--- /var/spool/PBS/mom_logs/*528 --on node01---
05/28/2004 13:15:51;0008;pbs_mom;Job;7798.prism;Started, pid = 5656
05/28/2004 14:56:26;0008;pbs_mom;Job;7798.prism;send POLL failed
05/28/2004 14:56:24;0008;pbs_mom;Job;7798.prism;node 1 (node11) requested
job die, code 15009
05/28/2004 14:56:24;0008;pbs_mom;Job;7798.prism;kill_job
05/28/2004 14:56:37;0080;pbs_mom;Job;7798.prism;task 1 terminated
05/28/2004 14:56:37;0008;pbs_mom;Job;7798.prism;Terminated
05/28/2004 14:56:37;0008;pbs_mom;Job;7798.prism;kill_job
05/28/2004 14:56:37;0001;pbs_mom;Svr;pbs_mom;Exec format error (8) in
run_pelog,
execle of prologue failed
05/28/2004 14:56:37;0001;pbs_mom;Job;7798.prism;pro/epilogue failed,
file: /var/spool/PBS/mom_priv/epilogue, exit: 255, nonzero p/e exit status
--------------------------

----here is your email----------------
Dear Jeremy,
I have been at the Beowulf cluster workshop at
MIT that you were presenting two years ago. Since then I have been using
beowulf clusters all over time. I have been trying to solve a
problem (a bug) for two weeks already. I am supposed to defend
a PhD in August, time is short. I would really appreciate your help,
since it will make things move then. Please help me solve
this problem, if possible.
When I run the code, the program stops crashes
after about 40 timesteps when ran without submitting
to PBSPro by qsub.
When I run the code by submitting by qsub of PBSPro, I get this
error diagnostics after about 10 timesteps, and the run dies:
1) Standard error PBSPro file int2.pbs.e919:
Warning: No xauth data; using fake authentication data for X11 forwarding.
=>> PBS: job killed: node 17 (node18) requested job die, code 15009
2) File /var/spool/PBS/mom_logs/20040508 on node18:
13:31:17;0008;pbs_mom;Job;919.antares.mit.edu;JOIN JOB as node 17
15:04:46;0004;pbs_mom;Job;919.antares.mit.edu;polling stopped
15:04:46;0008;pbs_mom;Job;919.antares.mit.edu;kill_job
3) File /var/spool/PBS/mom_logs/20040508 on node1:
11:43:54;0008;pbs_mom;Job;790.antares.mit.edu;Started, pid = 13919
13:18:12;0008;pbs_mom;Job;844.antares.mit.edu;Started, pid = 12919
13:31:17;0008;pbs_mom;Job;919.antares.mit.edu;Started, pid = 14043
15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;send POLL failed
15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;node 17 (node18) requested
job die, code 15009
15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;kill_job
15:06:48;0080;pbs_mom;Job;919.antares.mit.edu;task 1 terminated
15:06:48;0008;pbs_mom;Job;919.antares.mit.edu;Terminated
15:06:58;0008;pbs_mom;Job;919.antares.mit.edu;kill_job
15:06:58;0100;pbs_mom;Job;919.antares.mit.edu;Obit sent
4)
The error messages in the standard output files on these nodes look the
same:
p67_5862: p4_error: net_recv read: probable EOF on socket: 1
However on node16, it is
 p64_6016: (5813.998720) net_recv failed for fd = 3
 p64_6016:  p4_error: net_recv read, errno = : 104
on node4 it is
 p16_6446: (5832.189857) net_recv failed for fd = 3
 p16_6446:  p4_error: net_recv read, errno = : 104
Thank you, and I would really appreciate your help.
Regards,
Alex
--------------------------------




-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to