>From [EMAIL PROTECTED] Sat May 29 12:20:25 2004 Date: Sat, 29 May 2004 11:08:31 -0500 (CDT) From: Michael McKee <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: "send POLL failed" problem under PBS
Alex, I have run the same code for months on our cluster. After we upgraded to PBSPro5.4 (the only change to the system), I started to have these "send POLL failed" on a regular basis. I was really hoping that you had a fix. Anyway, my suggestion would to to go back to PBSPro5.3.x which was very stable (for me). If you hear more about this problem, please let me know. Yours, Mike >Date: Sat, 29 May 2004 11:56:21 -0400 (EDT) >From: Alexander V Shirokov <[EMAIL PROTECTED]> >To: Michael McKee <[EMAIL PROTECTED]> >cc: [EMAIL PROTECTED], [EMAIL PROTECTED] >Dear Mike, >Disabling X11 did not help. >I did not find any way to solve this problem, except >reducing the problem size from 1024^3 to 800^3 particles, >which is not good, since the cluster size >memory and CPU requirements allows for the 1024^3 >problem size with a little swap memory usage. I spent a lot of time >in vain looking for a bug in the code. >I am using PBSPro5.4 (pbs-5.4.0.40152-0 rpm package), >maybe it might help to downgrade to earlier version? >Jeremy Enos also suggested that this might >be due to "ulimit problem". I checked on my >system by typing "cexec ulimit" and every >node shows "unlimited". >Regards, >Alex >From [EMAIL PROTECTED] Sat May 29 12:20:19 2004 Date: Sat, 29 May 2004 10:29:54 -0500 (CDT) From: Michael McKee <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: "send POLL failed" problem under PBS Dear Alex, I found your email message on a web site concerning your jobs that failed under PBS. The nature of failing is exactly the same as the problem that I have. The response to your message was to add "ForwardX11 no" to ssh but I have my doubts that would work. Actually, the problem with "send POLL failed" started just after we upgraded to PBSPro5.4. You did not mention which version of PBS you were using. Was it PBSPro5.4? Have you made any progress in resolving this issue? Thanks in advance, Mike McKee, Professor of Chemistry Auburn University Auburn, AL --here is my email to my sysadm----- Phil, My Gaussian jobs have been failing after several hours of computing due to "send POLL failed". I have no idea what causes this error but I feel it must be related to the new PBS version. Did you make any other changes to the system? See the following URL: http://www.mail-archive.com/[EMAIL PROTECTED]/msg03686.html --- /var/spool/PBS/mom_logs/*528 --on node01--- 05/28/2004 13:15:51;0008;pbs_mom;Job;7798.prism;Started, pid = 5656 05/28/2004 14:56:26;0008;pbs_mom;Job;7798.prism;send POLL failed 05/28/2004 14:56:24;0008;pbs_mom;Job;7798.prism;node 1 (node11) requested job die, code 15009 05/28/2004 14:56:24;0008;pbs_mom;Job;7798.prism;kill_job 05/28/2004 14:56:37;0080;pbs_mom;Job;7798.prism;task 1 terminated 05/28/2004 14:56:37;0008;pbs_mom;Job;7798.prism;Terminated 05/28/2004 14:56:37;0008;pbs_mom;Job;7798.prism;kill_job 05/28/2004 14:56:37;0001;pbs_mom;Svr;pbs_mom;Exec format error (8) in run_pelog, execle of prologue failed 05/28/2004 14:56:37;0001;pbs_mom;Job;7798.prism;pro/epilogue failed, file: /var/spool/PBS/mom_priv/epilogue, exit: 255, nonzero p/e exit status -------------------------- ----here is your email---------------- Dear Jeremy, I have been at the Beowulf cluster workshop at MIT that you were presenting two years ago. Since then I have been using beowulf clusters all over time. I have been trying to solve a problem (a bug) for two weeks already. I am supposed to defend a PhD in August, time is short. I would really appreciate your help, since it will make things move then. Please help me solve this problem, if possible. When I run the code, the program stops crashes after about 40 timesteps when ran without submitting to PBSPro by qsub. When I run the code by submitting by qsub of PBSPro, I get this error diagnostics after about 10 timesteps, and the run dies: 1) Standard error PBSPro file int2.pbs.e919: Warning: No xauth data; using fake authentication data for X11 forwarding. =>> PBS: job killed: node 17 (node18) requested job die, code 15009 2) File /var/spool/PBS/mom_logs/20040508 on node18: 13:31:17;0008;pbs_mom;Job;919.antares.mit.edu;JOIN JOB as node 17 15:04:46;0004;pbs_mom;Job;919.antares.mit.edu;polling stopped 15:04:46;0008;pbs_mom;Job;919.antares.mit.edu;kill_job 3) File /var/spool/PBS/mom_logs/20040508 on node1: 11:43:54;0008;pbs_mom;Job;790.antares.mit.edu;Started, pid = 13919 13:18:12;0008;pbs_mom;Job;844.antares.mit.edu;Started, pid = 12919 13:31:17;0008;pbs_mom;Job;919.antares.mit.edu;Started, pid = 14043 15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;send POLL failed 15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;node 17 (node18) requested job die, code 15009 15:06:46;0008;pbs_mom;Job;919.antares.mit.edu;kill_job 15:06:48;0080;pbs_mom;Job;919.antares.mit.edu;task 1 terminated 15:06:48;0008;pbs_mom;Job;919.antares.mit.edu;Terminated 15:06:58;0008;pbs_mom;Job;919.antares.mit.edu;kill_job 15:06:58;0100;pbs_mom;Job;919.antares.mit.edu;Obit sent 4) The error messages in the standard output files on these nodes look the same: p67_5862: p4_error: net_recv read: probable EOF on socket: 1 However on node16, it is p64_6016: (5813.998720) net_recv failed for fd = 3 p64_6016: p4_error: net_recv read, errno = : 104 on node4 it is p16_6446: (5832.189857) net_recv failed for fd = 3 p16_6446: p4_error: net_recv read, errno = : 104 Thank you, and I would really appreciate your help. Regards, Alex -------------------------------- ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
