Hello,

Thank you for your reply... For one thing, I was pretty confident myself that this has something to do with some system limitation. Unfortunately I tried pretty much everything I could think off - I increased every possible limit (tcp, number of processes, sockets, etc.) - nothing worked.

I am working inside a completely reconfigurable grid and I have full administrative rights on the machines I'm deploying. The problem I was mentioning appeared for me even for a simple MPI program like MPI_Init+MPI_Finalize (nothing in between). Furthermore, without using Globus+MPICH-G2, with different MPI distributions, there are no problems - I manage to deploy up to 800 processes with no problems.

As a conclusion - I could not find the problem and I had to move to different "approaches"; anyway at the time, I was wondering if anyone was having the same problem. As  I was saying I have complete administrative rights on the machines I'm using - I could start everything from scratch if required... For now I tried different Globus/MPICH-G2 versions on Fedora Core 4 and Debian. I have to be completely missing something out of the picture.

Any help on the matter or indications on how to deal with it (given the above stated) would be much appreciated!

I wish you a great day!

Alex

  Quoting Karonis Nicholas <[EMAIL PROTECTED]>:

I've seen this now and again and we don't know, for sure, what the
problem is.  My best guess is that this is triggered by some sort
of OS limit on sockets for some OS's.  When MPICH-G2 starts up,
in MPI_Init(), *every* process starts with a listen() (specifying backlog=1)
on an ephemeral port (one assigned by OS) but "connect" is only done
on-demand.

A simple test would be to write a small program that only does a listen
(meticulously checking the error code for every OS call) and then perhaps
sleeps for a couple of minutes.  Use Globus to launch the application
increasing the (count=xxx) value to 337 or higher *on the same system*
you're seeing the MPICH-G2 problem and my best guess is that you'll
trigger the same error.  If so, then that's the problem (hitting an
OS limit) and one possible solution is to see if that limit can be
increased by the sys admin.

Nick

On Jun 8, 2007, at 2:16 PM, [EMAIL PROTECTED] wrote:

Hello,

Reposting, maybe someone has encountered this before... On two  different clusters, one with Fedora and another one with Debian it  seems I cannot launch more than __337__ MPICH-G2 processes (using  only inside-cluster machines). It works with simple programs but  not with MPICH-G2 compiled programs (as simple as just an MPI_Init  + MPI_Finalize, nothing more).

Is there anyone who could INFIRM this (someone currently launching  more than 337 MPICH-G2 processes) or, why not, confirm this? Is  there any related cause to this? Could someone offer some clues?

Have a nice day, everyone!

Alex



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Reply via email to