Hello,
I hope there is someone who could help me out with this as I've
been trying to solve this for quite some time now - I need URGENT help
- did anyone ever had the following error:
'MPICH-G2: read failure - globus_xio: System error in read:
Connection reset by peer, state=await_format' ? - Globus 4.0.1/MPICH-G2 1.2.7
Or does anyone know which might be the cause? For me it seems to
appear only when I'm trying to launch a number of processes somewhere
above 300 - it seems to settle at 337. If I try to run 338 or more, it
does not work! I've tested this with a very simple program which does
nothing more than an MPI_Init + display host name + MPI_Finalize -
nothing more. The problem seems to be somewhere in MPI_Finalize
function where the MPI_Barrier is called. Is this related to
collective functions, maybe?
I tried everything - I increased system limits (net.core.rmem_max,
net.core.wmem_max, net.core.wmem_max, net.ipv4.tcp_wmem,
net.core.netdev_max_backlog, tcp_syn_retries, file-max, etc./), / ssh
max number of concurrent connections, etc. The master machine never
runs out of memory or CPU. It makes no difference if I run this on a
reduced number of CPUs or on a large number of CPUs (let's say above
100 biproc.) - I have the same error. More, there are no problems when
I launch this with less than 300 processes!!
To make this more "interesting" - on the same machine I launched
three mpiruns in the same time from three different shells, each of
them with 300 processes and it worked - none of the three reported any
error!! This has to be somewhere in the code... anyone could give a
clue on this? Should I send this also to a different thread,
developers, etc?
The awkward thing is that at some point I had this running with
about 800 processes (with a different configuration which it seems I
messed up by changing settings, packages, etc.). What am I missing? It
seems to be some kind of timeout value/upper limit on some data
structure (?!)... in this case how can you set a tcp/socket timeout
for mpirun/globusrun? Or is this a hardcoded timeout value (I looked
over the gram code but I don't find a thing).
Is this a BUG? I found a link in the globus archive pointing back to
2003 mentioning a similar thing but with exactly 253 (!!) processes:
http://www-unix.globus.org/mail_archive/discuss/2003/11/msg00384.html[1]
If this is a bug, is there a patch, etc. (couldn't find anything)?
I would really appreciate any clue on this matter - I've spent quite
some time (months) on this and I really cannot figure it out and it
becomes really URGENT.
Thank You in advance for all your help!
Alexandru-Adrian TANTAR
P.S. A few more detailed error lines bellow:
MPICH-G2: read failure - globus_xio: System error in read: Connection
reset by peer, state=await_format
MPICH-G2: read failure - globus_xio: System error in read: Connection
reset by peer, state=await_format
MPICH-G2: ERROR: prime_the_line: connect failed
MPICH-G2: ERROR: prime_the_line: connect failed
MPICH-G2: ERROR: listen_callback rcvd result != GLOBUS_SUCCESS
MPICH-G2: ERROR: prime_the_line: connect failed
ERROR: MPID_Abort: failed remote globus_gram_client_job_cancel to job contact
https://[EMAIL PROTECTED]:32771/3762/1166815735/[1[2]]< ERROR:
MPID_Abort: failed remote globus_gram_client_job_cancel to job contact
> https://[EMAIL PROTECTED]:32771/3762/1166815735/[3[3]]<
MPID_Abort: failed
globus_module_activate(GLOBUS_GRAM_CLIENT_MODULE)ERROR: MPID_Abort:
failed remote globus_gram_client_job_cancel to job contact
https://[EMAIL PROTECTED]:32771/3749/1166815737/[5[4]]<
ERROR: MPID_Abort: failed remote globus_gram_client_job_cancel to job
contact
... (same errors repeating multiple times for different machines)
Links:
------
[1] http://www-unix.globus.org/mail_archive/discuss/2003/11/msg00384.html
[2] https://[EMAIL PROTECTED]:32771/3762/1166815735/%5B1
[3] https://[EMAIL PROTECTED]:32771/3762/1166815735/%5B3
[4] https://[EMAIL PROTECTED]:32771/3749/1166815737/%5B5
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.