Hello,

I hope there is someone who could help me out with this as I've been trying to solve this for quite some time now - I need URGENT help - did anyone ever had the following error:

'MPICH-G2: read failure - globus_xio: System error in read:
Connection reset by peer, state=await_format' ? - Globus  4.0.1/MPICH-G2 1.2.7

Or does anyone know which might be the cause? For me it seems to appear only when I'm  trying to launch a number of processes somewhere above 300 - it seems to settle at 337. If I try to run 338 or more, it does not work! I've tested this with a very simple program which does nothing more than an MPI_Init + display host name + MPI_Finalize - nothing more. The problem seems to be somewhere in MPI_Finalize function where the MPI_Barrier is called. Is this related to collective functions, maybe?

I tried everything - I increased system limits (net.core.rmem_max, net.core.wmem_max, net.core.wmem_max, net.ipv4.tcp_wmem, net.core.netdev_max_backlog, tcp_syn_retries, file-max, etc./), / ssh max number of concurrent connections, etc. The master machine never runs out of memory or CPU. It makes no difference if I run this on a reduced number of CPUs or on a large number of CPUs (let's say above 100 biproc.) - I have the same error. More, there are no problems when I launch this with less than 300 processes!!

To make this more "interesting" - on the same machine I launched three mpiruns in the same time from three different shells, each of them with 300 processes and it worked - none of the three reported any error!! This has to be somewhere in the code... anyone could give a clue on this? Should I send this also to a different thread, developers, etc?

The awkward thing is that at some point I had this running with about 800 processes (with a different configuration which it seems I messed up by changing settings, packages, etc.). What am I missing? It seems to be some kind of timeout value/upper limit on some data structure (?!)... in this case how can you set a tcp/socket timeout for mpirun/globusrun? Or is this a hardcoded timeout value (I looked over the gram code but I don't find a thing).

Is this a BUG? I found a link in the globus archive pointing back to 2003 mentioning a similar thing but with exactly 253 (!!) processes:

http://www-unix.globus.org/mail_archive/discuss/2003/11/msg00384.html[1]

If this is a bug, is there a patch, etc. (couldn't find anything)?

I would really appreciate any clue on this matter - I've spent  quite some time (months) on this and I really cannot figure it out and it becomes really URGENT.

Thank You in advance for all your help!

Alexandru-Adrian TANTAR

P.S. A few more detailed error lines bellow:

MPICH-G2: read failure - globus_xio: System error in read: Connection reset by peer, state=await_format MPICH-G2: read failure - globus_xio: System error in read: Connection reset by peer, state=await_format
MPICH-G2: ERROR: prime_the_line: connect failed
MPICH-G2: ERROR: prime_the_line: connect failed
MPICH-G2: ERROR: listen_callback rcvd result != GLOBUS_SUCCESS
MPICH-G2: ERROR: prime_the_line: connect failed
ERROR: MPID_Abort: failed remote globus_gram_client_job_cancel to job contact
https://[EMAIL PROTECTED]:32771/3762/1166815735/[1[2]]< ERROR: MPID_Abort: failed remote globus_gram_client_job_cancel to job contact

> https://[EMAIL PROTECTED]:32771/3762/1166815735/[3[3]]< MPID_Abort: failed
globus_module_activate(GLOBUS_GRAM_CLIENT_MODULE)ERROR: MPID_Abort:
failed remote globus_gram_client_job_cancel to job contact
https://[EMAIL PROTECTED]:32771/3749/1166815737/[5[4]]<
ERROR: MPID_Abort: failed remote globus_gram_client_job_cancel to job
contact
... (same errors repeating multiple times for different machines)

Links:
------
[1] http://www-unix.globus.org/mail_archive/discuss/2003/11/msg00384.html
[2] https://[EMAIL PROTECTED]:32771/3762/1166815735/%5B1
[3] https://[EMAIL PROTECTED]:32771/3762/1166815735/%5B3
[4] https://[EMAIL PROTECTED]:32771/3749/1166815737/%5B5


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Reply via email to