Also, I wrote to a person who seemed to have the same problem with sgi
mpt and this is what he wrote back (he did not use paraview, but ran a
cluster of sgi altix systems). I don't know much of mpi so i'm still
looking into this, but i thought this may be of some assistance to the
paraview developers if they try to see why this problem is occurring:
On Wed, Apr 27, 2011 at 02:53:49PM +1000, pratik for help wrote:
> The startup mechanism for SGI MPI jobs is quite complex and depends on
> the type of executable you are running. If you encounter errors such as
> ctrl_connect/connect: Connection refused
> or
> mpirun: MPT error (MPI_RM_sethosts): err=-1: could not run executable
> (case #3)
> contact us for an explanation.
>
> Can you please explain why such errors occur? I am running paraview on a
> sgi altix cluster and am getting the exact same error!
I worked in some depth on MPT while we had our Altix. Here are the
details that I remember.
During startup, mpirun will listen on a certain IP/port. It puts the
IP/port into an environment variable (MPI_ENVIRONMENT, perhaps? I
forget, it starts with MPI_* though), and then starts the worker
processes. The worker processes (actually, 1 "shepherd" process per
node) will examine $MPI_ENVIRONMENT, and then using those details,
connect back to the mpirun process. This connection is then used to
communicate job details, as well as stdin/out/err.
The error indicates that this connection could not be made. The main
reasons are, either the $MPI_ENVIRONMENT variable hasn't been
propagated properly, or some other process has already connected to
the mpirun (the mpirun will stop listening once it receives the right
number of connections), usually because some other MPI program has
already connected (eg. if the MPI worker program is somehow run
twice), or perhaps if there is a firewall or (TCP/IP) networking issue
between the remote worker nodes and the node running mpirun.
I hope that helps.
Kev
-- Dr Kevin Pulo [email protected] Academic Consultant / Systems
Programmer www.kev.pulo.com.au NCI NF / ANU SF +61 2 6125 7568
On Thursday 28 April 2011 02:33 PM, pratik wrote:
Hi,
Also, can you please tell me how can i rebuild paraview with the
*static* library of the plugin (i.e the .a file)? Although this is a
very inelegant way to solve the problem, I just want the functionality
of the TensorGlyph plugin.
pratik
On Thursday 28 April 2011 02:09 PM, pratik wrote:
Hi Utkarsh,
So...do you have a hunch what may be going on? I'm sorry if i have
been troubling you a lot, but this is really the last stage to get PV
working on the cluster:as i said before, the build with
BUILD_SHARED_LIB off worked perfectly, but the one with that option
on did not....
The thing that bothers me is that it is definetly not something
wrong with sgi mpt, since one build of pvserver is working fine.
Having reached so far, it is driving me crazy that it is still not
able to work :(
If you need any more information please do let me know. Once again
thanks for all the help.
pratik
On Wednesday 27 April 2011 08:35 PM, pratik wrote:
I think it is :
pratikm@annapurna:~/source/ParaView/ParaView-3.10.1/NEWBUILD/bin>
ldd /home/pratikm/source/ParaView/ParaView-3.10.1/BUILD/bin/pvserver
|grep mp
libmpi++abi1002.so =>
/opt/sgi/mpt/mpt-1.23/lib/libmpi++abi1002.so (0x00002b61473a3000)
libmpi.so => /opt/sgi/mpt/mpt-1.23/lib/libmpi.so
(0x00002b61474d0000)
libsma.so => /opt/sgi/mpt/mpt-1.23/lib/libsma.so
(0x00002b6147854000)
libxmpi.so => /opt/sgi/mpt/mpt-1.23/lib/libxmpi.so
(0x00002b614ed46000)
libimf.so => /opt/intel/Compiler/11.1/038/lib/intel64/libimf.so
(0x00002b6153a16000)
libsvml.so =>
/opt/intel/Compiler/11.1/038/lib/intel64/libsvml.so
(0x00002b6153d69000)
libintlc.so.5 =>
/opt/intel/Compiler/11.1/038/lib/intel64/libintlc.so.5
(0x00002b6153f80000)
pratikm@annapurna:~/source/ParaView/ParaView-3.10.1/NEWBUILD/bin>
ldd
/home/pratikm/source/ParaView/ParaView-3.10.1/NEWBUILD/bin/pvserver
|grep mp
libmpi++abi1002.so =>
/opt/sgi/mpt/mpt-1.23/lib/libmpi++abi1002.so (0x00002ac9ae446000)
libmpi.so => /opt/sgi/mpt/mpt-1.23/lib/libmpi.so
(0x00002ac9ae573000)
libsma.so => /opt/sgi/mpt/mpt-1.23/lib/libsma.so
(0x00002ac9ae8f7000)
libxmpi.so => /opt/sgi/mpt/mpt-1.23/lib/libxmpi.so
(0x00002ac9aee0f000)
pratikm@annapurna:~/source/ParaView/ParaView-3.10.1/NEWBUILD/bin>
ldd /home/pratikm/install/bin/pvserver |grep mp
libmpi.so => /usr/lib64/libmpi.so (0x00002b0a9c9e3000)
These are precisely the libraries i specified; the first one is the
pvserver with shared libs enabled, second one with shared lib
disabled, and the last one is the "installed" pvserver(installed
version of pvserver with shared libs enabled)
Again, the last one is the "installed" pvserver; i am not quite sure
why the path has changed, but i am 90% sure that
/usr/lib64/libmpi.so refers to the same sgi mpi lib.
pratik
On Wednesday 27 April 2011 07:26 PM, Utkarsh Ayachit wrote:
Do a "pvserver --ldd", is it using the correct mpi libraries?
Utkarsh
On Wed, Apr 27, 2011 at 8:43 AM, pratik<[email protected]>
wrote:
Also, i tried to start the pvserver (with shared libraries
enabled) on just
the head node:
pratikm@annapurna:~/install/bin> /usr/bin/mpirun -v -np 2
/home/pratikm/install/bin/pvserver
MPI: libxmpi.so 'SGI MPT 1.23 03/28/09 11:45:59'
MPI: libmpi.so 'SGI MPT 1.23 03/28/09 11:43:39'
and it just hangs there!
pratik
On Wednesday 27 April 2011 06:05 PM, pratik wrote:
oh! I'm sorry about that....
the client stalls indefinitely, but the server will stop
executing. Since
I am running pv using PBS, the output file of the mpirun gives this:
MPI: libxmpi.so 'SGI MPT 1.23 03/28/09 11:45:59'
MPI: libmpi.so 'SGI MPT 1.23 03/28/09 11:43:39'
MPI Environmental Settings
MPI: MPI_DSM_DISTRIBUTE (default: not set) : 1
ctrl_connect/connect: Connection refused
ctrl_connect/connect: Connection refused
ctrl_connect/connect: Connection refused
ctrl_connect/connect: Connection refused
ctrl_connect/connect: Connection refused
ctrl_connect/connect: Connection refused
ctrl_connect/connect: Connection refused
ctrl_connect/connect: Connection refused
MPI: MPI_COMM_WORLD rank 2 has terminated without calling
MPI_Finalize()
MPI: aborting job
Attached is the cmakecahe of my server if you want to look at it.
pratik
On Wednesday 27 April 2011 05:55 PM, Utkarsh Ayachit wrote:
You need to be more specific about the "something" that's going
wrong
before anyone can provide any additional information.
Utkarsh
On Wed, Apr 27, 2011 at 3:26 AM,
pratik<[email protected]> wrote:
Hi,
I built 2 versions of pv on the sgi altix cluster here(sgi mpt
mpi)...one
with BUILD_SHARED_LIBS enabled and one without. Now, the static
pvserver
functions properly (i am accessing thru laptop via the reverse
connection
method) BUT the one with shared_libs enabled does not! Can this
behaviour be
explained? (the second one fails to establish a
connection...something
wrong
with pvserver)
I have EXACTLY the same cmakecache on both build EXCEPT the
BUILD_SHARED_LIBS option.
I know that there are many many things that could go wrong in a
cluster
installation. So any hints/experience/hunch as to what is going
on is
welcome.
pratik
_______________________________________________
Powered by www.kitware.com
Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html
Please keep messages on-topic and check the ParaView Wiki at:
http://paraview.org/Wiki/ParaView
Follow this link to subscribe/unsubscribe:
http://www.paraview.org/mailman/listinfo/paraview
_______________________________________________
Powered by www.kitware.com
Visit other Kitware open-source projects at
http://www.kitware.com/opensource/opensource.html
Please keep messages on-topic and check the ParaView Wiki at:
http://paraview.org/Wiki/ParaView
Follow this link to subscribe/unsubscribe:
http://www.paraview.org/mailman/listinfo/paraview