Hello,
Thank you for your reply... For one thing, I was pretty confident
myself that this has something to do with some system limitation.
Unfortunately I tried pretty much everything I could think off - I
increased every possible limit (tcp, number of processes, sockets,
etc.) - nothing worked.
I am working inside a completely reconfigurable grid and I have full
administrative rights on the machines I'm deploying. The problem I was
mentioning appeared for me even for a simple MPI program like
MPI_Init+MPI_Finalize (nothing in between). Furthermore, without using
Globus+MPICH-G2, with different MPI distributions, there are no
problems - I manage to deploy up to 800 processes with no problems.
As a conclusion - I could not find the problem and I had to move to
different "approaches"; anyway at the time, I was wondering if anyone
was having the same problem. As I was saying I have complete
administrative rights on the machines I'm using - I could start
everything from scratch if required... For now I tried different
Globus/MPICH-G2 versions on Fedora Core 4 and Debian. I have to be
completely missing something out of the picture.
Any help on the matter or indications on how to deal with it (given
the above stated) would be much appreciated!
I wish you a great day!
Alex
Quoting Karonis Nicholas <[EMAIL PROTECTED]>:
I've seen this now and again and we don't know, for sure, what the
problem is. My best guess is that this is triggered by some sort
of OS limit on sockets for some OS's. When MPICH-G2 starts up,
in MPI_Init(), *every* process starts with a listen() (specifying backlog=1)
on an ephemeral port (one assigned by OS) but "connect" is only done
on-demand.
A simple test would be to write a small program that only does a listen
(meticulously checking the error code for every OS call) and then perhaps
sleeps for a couple of minutes. Use Globus to launch the application
increasing the (count=xxx) value to 337 or higher *on the same system*
you're seeing the MPICH-G2 problem and my best guess is that you'll
trigger the same error. If so, then that's the problem (hitting an
OS limit) and one possible solution is to see if that limit can be
increased by the sys admin.
Nick
On Jun 8, 2007, at 2:16 PM, [EMAIL PROTECTED] wrote:
Hello,
Reposting, maybe someone has encountered this before... On two
different clusters, one with Fedora and another one with Debian it
seems I cannot launch more than __337__ MPICH-G2 processes (using
only inside-cluster machines). It works with simple programs but
not with MPICH-G2 compiled programs (as simple as just an MPI_Init
+ MPI_Finalize, nothing more).
Is there anyone who could INFIRM this (someone currently launching
more than 337 MPICH-G2 processes) or, why not, confirm this? Is
there any related cause to this? Could someone offer some clues?
Have a nice day, everyone!
Alex
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.