Good catch. Indeed, the root node was on prism120, another node in the batch pool. When I tunneled to that host instead of the other, I got a good connection with 2 servers using MPI. Just to be sure, is there a way to query the state of the connection from within the client? I cannot tell from the GUI or the server output whether I am connected to 2 servers or 1. I am certain I launched two servers and got a good connection, and I can view a molecule, but... I'm paranoid. You never know. :-)
This is going to be nasty to try to make work for our users. Thanks for the help! -- Rich On Nov 11, 2011, at 4:57 PM, Utkarsh Ayachit wrote: > Very peculiar. I wonder if MPI is running the root node on some other > node. Are you sure the process is run on the same machine? Can you > trying putting an IP address or real hostname instead of localhost? > > Utkarsh > > On Fri, Nov 11, 2011 at 7:54 PM, Cook, Rich <[email protected]> wrote: >> And to clarify, if I just do serial, I get this good behavior: >> >> rcook@prism127 (~): >> /usr/global/tools/Kitware/Paraview/3.12.0-OSMesa/chaos_4_x86_64_ib/bin/pvserver >> --use-offscreen-rendering --reverse-connection --client-host=localhost >> Waiting for client >> Connection URL: csrc://localhost:11111 >> Client connected. >> >> On Nov 11, 2011, at 4:51 PM, Cook, Rich wrote: >> >>> My bad. >>> The first email I sent I was using the wrong MPI (srun instead of mpiexec >>> -- mvapich instead of openmpi). So both processes were indeed getting set >>> to the same process ID. Please ignore that output. >>> The current output looks like this: >>> >>> rcook@prism127 (~): mpiexec -np 2 >>> /usr/global/tools/Kitware/Paraview/3.12.0-OSMesa/chaos_4_x86_64_ib/bin/pvserver >>> --use-offscreen-rendering --reverse-connection --client-host=localhost >>> Waiting for client >>> Connection URL: csrc://localhost:11111 >>> ERROR: In >>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkSocket.cxx, >>> line 481 >>> vtkClientSocket (0xe6a060): Socket error in call to connect. Connection >>> refused. >>> >>> ERROR: In >>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkClientSocket.cxx, >>> line 53 >>> vtkClientSocket (0xe6a060): Failed to connect to server localhost:11111 >>> >>> Warning: In >>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/ParaViewCore/ClientServerCore/vtkTCPNetworkAccessManager.cxx, >>> line 250 >>> vtkTCPNetworkAccessManager (0x8356f0): Connect failed. Retrying for >>> 59.9993 more seconds. >>> >>> ERROR: In >>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkSocket.cxx, >>> line 481 >>> vtkClientSocket (0xe6a060): Socket error in call to connect. Connection >>> refused. >>> >>> ERROR: In >>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkClientSocket.cxx, >>> line 53 >>> vtkClientSocket (0xe6a060): Failed to connect to server localhost:11111 >>> >>> Warning: In >>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/ParaViewCore/ClientServerCore/vtkTCPNetworkAccessManager.cxx, >>> line 250 >>> vtkTCPNetworkAccessManager (0x8356f0): Connect failed. Retrying for >>> 58.9972 more seconds. >>> >>> mpiexec: killing job... >>> >>> >>> Note the presence of only one connecting message. Again, I apologize for >>> the mixup. I spoke with our MPI guru and have confirmed that MPI appears >>> to be working correctly and I'm not making a mistake in how I launch >>> pvserver from the batch job perspective. >>> >>> Do you still want that output? >>> >>> On Nov 11, 2011, at 4:44 PM, Utkarsh Ayachit wrote: >>> >>>> That sounds very odd. If process_id variable is indeed correctly set >>>> to 0 and 1 on the two processes, then how come there are two "Waiting >>>> for client" lines printed out in the first email that you sent? >>>> >>>> Can you change that line cout to the following to verify that both >>>> processes are indeed printing out from the same time? >>>> >>>> cout << __LINE__ << " : Waiting for client" << endl; >>>> >>>> (This is in pvserver_common.h: 58) >>>> >>>> Utkarsh >>>> >>>> On Fri, Nov 11, 2011 at 6:30 PM, Cook, Rich <[email protected]> wrote: >>>>> I posted the CMakeCache.txt. I also have tried to step through the code >>>>> using TotalView and I can see it calling MPI_init() etc. It looks like >>>>> one process correctly gets rank 0 and one gets rank 1 (by inspecting >>>>> process_id variable in RealMain()) >>>>> If I start in serial, it connects and I can view a protein molecule >>>>> successfully. If I start in parallel, exactly one server tries and fails >>>>> to connect. Am I supposed to give any extra arguments when starting in >>>>> parallel? >>>>> This is what I'm doing: >>>>> >>>>> mpiexec -np 2 >>>>> /usr/global/tools/Kitware/Paraview/3.12.0-OSMesa/chaos_4_x86_64_ib/bin/pvserver >>>>> --use-offscreen-rendering --reverse-connection --client-host=localhost >>>>> >>>>> >>>>> >>>>> On Nov 11, 2011, at 11:11 AM, Utkarsh Ayachit wrote: >>>>> >>>>>> Can you post your CMakeCache.txt? >>>>>> >>>>>> Utkarsh >>>>>> >>>>>> On Fri, Nov 11, 2011 at 2:08 PM, Cook, Rich <[email protected]> wrote: >>>>>>> Hi, thanks, but you are incorrect. >>>>>>> I did set that variable and it was indeed compiled with MPI, as I said. >>>>>>> >>>>>>> rcook@prism127 (IMG_private): type pvserver >>>>>>> pvserver is >>>>>>> /usr/global/tools/Kitware/Paraview/3.12.0-OSMesa/chaos_4_x86_64_ib/bin/pvserver >>>>>>> rcook@prism127 (IMG_private): ldd >>>>>>> /usr/global/tools/Kitware/Paraview/3.12.0-OSMesa/chaos_4_x86_64_ib/bin/pvserver >>>>>>> libmpi.so.0 => /usr/local/tools/openmpi-gnu-1.4.3/lib/libmpi.so.0 >>>>>>> (0x00002aaaaacc9000) >>>>>>> libopen-rte.so.0 => >>>>>>> /usr/local/tools/openmpi-gnu-1.4.3/lib/libopen-rte.so.0 >>>>>>> (0x00002aaaaaf6c000) >>>>>>> libopen-pal.so.0 => >>>>>>> /usr/local/tools/openmpi-gnu-1.4.3/lib/libopen-pal.so.0 >>>>>>> (0x00002aaaab1b7000) >>>>>>> libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab434000) >>>>>>> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaab638000) >>>>>>> libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab850000) >>>>>>> libm.so.6 => /lib64/libm.so.6 (0x00002aaaaba54000) >>>>>>> libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaabcd7000) >>>>>>> libc.so.6 => /lib64/libc.so.6 (0x00002aaaabef2000) >>>>>>> /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000) >>>>>>> >>>>>>> When the pvservers are running, I can see that they are the correct >>>>>>> binaries, and ldd confirms they are MPI-capable. >>>>>>> >>>>>>> rcook@prism120 (~): ldd >>>>>>> /collab/usr/global/tools/Kitware/Paraview/3.12.0-OSMesa/chaos_4_x86_64_ib/lib/paraview-3.12/pvserver >>>>>>> | grep mpi >>>>>>> libmpi_cxx.so.0 => >>>>>>> /usr/local/tools/openmpi-gnu-1.4.3/lib/libmpi_cxx.so.0 >>>>>>> (0x00002aaab23bf000) >>>>>>> libmpi.so.0 => /usr/local/tools/openmpi-gnu-1.4.3/lib/libmpi.so.0 >>>>>>> (0x00002aaab25da000) >>>>>>> libopen-rte.so.0 => >>>>>>> /usr/local/tools/openmpi-gnu-1.4.3/lib/libopen-rte.so.0 >>>>>>> (0x00002aaab287d000) >>>>>>> libopen-pal.so.0 => >>>>>>> /usr/local/tools/openmpi-gnu-1.4.3/lib/libopen-pal.so.0 >>>>>>> (0x00002aaab2ac7000) >>>>>>> >>>>>>> >>>>>>> On Nov 11, 2011, at 11:04 AM, Utkarsh Ayachit wrote: >>>>>>> >>>>>>>> Your pvserver is not built with MPI enabled. Please rebuild pvserver >>>>>>>> with CMake variable PARAVIEW_USE_MPI:BOOL=ON. >>>>>>>> >>>>>>>> Utkarsh >>>>>>>> >>>>>>>> On Fri, Nov 11, 2011 at 1:54 PM, Cook, Rich <[email protected]> wrote: >>>>>>>>> We have a tricky firewall situation here so I have to use reverse >>>>>>>>> tunneling per >>>>>>>>> http://www.paraview.org/Wiki/Reverse_connection_and_port_forwarding#Reverse_Connection_Over_an_ssh_Tunnel >>>>>>>>> >>>>>>>>> I'm not sure I'm doing it right. I can do it with a single server, >>>>>>>>> but when I try to run in parallel, it looks like something is broken. >>>>>>>>> My understanding is that when launched under MPI, the servers should >>>>>>>>> talk to eachother and only one of the servers should try to connect >>>>>>>>> back to the client. I compiled with MPI, and am running in an MPI >>>>>>>>> environment, but it looks as though the pvservers are not talking to >>>>>>>>> each other but are each trying to make their own connection to the >>>>>>>>> client. Below is the output. Can anyone help me get this up and >>>>>>>>> running? I know I'm close. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> rcook@prism127 (IMG_private): srun -n 8 >>>>>>>>> /usr/global/tools/Kitware/Paraview/3.12.0-OSMesa/chaos_4_x86_64_ib/bin/pvserver >>>>>>>>> --use-offscreen-rendering --reverse-connection >>>>>>>>> --client-host=localhost >>>>>>>>> Waiting for client >>>>>>>>> Connection URL: csrc://localhost:11111 >>>>>>>>> Client connected. >>>>>>>>> Waiting for client >>>>>>>>> Connection URL: csrc://localhost:11111 >>>>>>>>> ERROR: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkSocket.cxx, >>>>>>>>> line 481 >>>>>>>>> vtkClientSocket (0xd8ee20): Socket error in call to connect. >>>>>>>>> Connection refused. >>>>>>>>> >>>>>>>>> ERROR: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkClientSocket.cxx, >>>>>>>>> line 53 >>>>>>>>> vtkClientSocket (0xd8ee20): Failed to connect to server >>>>>>>>> localhost:11111 >>>>>>>>> >>>>>>>>> Warning: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/ParaViewCore/ClientServerCore/vtkTCPNetworkAccessManager.cxx, >>>>>>>>> line 250 >>>>>>>>> vtkTCPNetworkAccessManager (0x6619a0): Connect failed. Retrying for >>>>>>>>> 59.9994 more seconds. >>>>>>>>> >>>>>>>>> ERROR: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkSocket.cxx, >>>>>>>>> line 481 >>>>>>>>> vtkClientSocket (0xd8ee20): Socket error in call to connect. >>>>>>>>> Connection refused. >>>>>>>>> >>>>>>>>> ERROR: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkClientSocket.cxx, >>>>>>>>> line 53 >>>>>>>>> vtkClientSocket (0xd8ee20): Failed to connect to server >>>>>>>>> localhost:11111 >>>>>>>>> >>>>>>>>> Warning: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/ParaViewCore/ClientServerCore/vtkTCPNetworkAccessManager.cxx, >>>>>>>>> line 250 >>>>>>>>> vtkTCPNetworkAccessManager (0x6619a0): Connect failed. Retrying for >>>>>>>>> 58.9972 more seconds. >>>>>>>>> >>>>>>>>> ERROR: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkSocket.cxx, >>>>>>>>> line 481 >>>>>>>>> vtkClientSocket (0xd8ee20): Socket error in call to connect. >>>>>>>>> Connection refused. >>>>>>>>> >>>>>>>>> ERROR: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/VTK/Common/vtkClientSocket.cxx, >>>>>>>>> line 53 >>>>>>>>> vtkClientSocket (0xd8ee20): Failed to connect to server >>>>>>>>> localhost:11111 >>>>>>>>> >>>>>>>>> Warning: In >>>>>>>>> /nfs/tmp2/rcook/ParaView/3.12.0/ParaView-3.12.0/ParaViewCore/ClientServerCore/vtkTCPNetworkAccessManager.cxx, >>>>>>>>> line 250 >>>>>>>>> vtkTCPNetworkAccessManager (0x6619a0): Connect failed. Retrying for >>>>>>>>> 57.9952 more seconds. >>>>>>>>> >>>>>>>>> >>>>>>>>> etc. etc. etc. >>>>>>>>> -- >>>>>>>>> ✐Richard Cook >>>>>>>>> ✇ Lawrence Livermore National Laboratory >>>>>>>>> Bldg-453 Rm-4024, Mail Stop L-557 >>>>>>>>> 7000 East Avenue, Livermore, CA, 94550, USA >>>>>>>>> ☎ (office) (925) 423-9605 >>>>>>>>> ☎ (fax) (925) 423-6961 >>>>>>>>> --- >>>>>>>>> Information Management & Graphics Grp., Services & Development Div., >>>>>>>>> Integrated Computing & Communications Dept. >>>>>>>>> (opinions expressed herein are mine and not those of LLNL) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Powered by www.kitware.com >>>>>>>>> >>>>>>>>> Visit other Kitware open-source projects at >>>>>>>>> http://www.kitware.com/opensource/opensource.html >>>>>>>>> >>>>>>>>> Please keep messages on-topic and check the ParaView Wiki at: >>>>>>>>> http://paraview.org/Wiki/ParaView >>>>>>>>> >>>>>>>>> Follow this link to subscribe/unsubscribe: >>>>>>>>> http://www.paraview.org/mailman/listinfo/paraview >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ✐Richard Cook >>>>>>> ✇ Lawrence Livermore National Laboratory >>>>>>> Bldg-453 Rm-4024, Mail Stop L-557 >>>>>>> 7000 East Avenue, Livermore, CA, 94550, USA >>>>>>> ☎ (office) (925) 423-9605 >>>>>>> ☎ (fax) (925) 423-6961 >>>>>>> --- >>>>>>> Information Management & Graphics Grp., Services & Development Div., >>>>>>> Integrated Computing & Communications Dept. >>>>>>> (opinions expressed herein are mine and not those of LLNL) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> ✐Richard Cook >>>>> ✇ Lawrence Livermore National Laboratory >>>>> Bldg-453 Rm-4024, Mail Stop L-557 >>>>> 7000 East Avenue, Livermore, CA, 94550, USA >>>>> ☎ (office) (925) 423-9605 >>>>> ☎ (fax) (925) 423-6961 >>>>> --- >>>>> Information Management & Graphics Grp., Services & Development Div., >>>>> Integrated Computing & Communications Dept. >>>>> (opinions expressed herein are mine and not those of LLNL) >>>>> >>>>> >>>>> >>>>> >>> >>> -- >>> ✐Richard Cook >>> ✇ Lawrence Livermore National Laboratory >>> Bldg-453 Rm-4024, Mail Stop L-557 >>> 7000 East Avenue, Livermore, CA, 94550, USA >>> ☎ (office) (925) 423-9605 >>> ☎ (fax) (925) 423-6961 >>> --- >>> Information Management & Graphics Grp., Services & Development Div., >>> Integrated Computing & Communications Dept. >>> (opinions expressed herein are mine and not those of LLNL) >>> >>> >>> >> >> -- >> ✐Richard Cook >> ✇ Lawrence Livermore National Laboratory >> Bldg-453 Rm-4024, Mail Stop L-557 >> 7000 East Avenue, Livermore, CA, 94550, USA >> ☎ (office) (925) 423-9605 >> ☎ (fax) (925) 423-6961 >> --- >> Information Management & Graphics Grp., Services & Development Div., >> Integrated Computing & Communications Dept. >> (opinions expressed herein are mine and not those of LLNL) >> >> >> >> -- ✐Richard Cook ✇ Lawrence Livermore National Laboratory Bldg-453 Rm-4024, Mail Stop L-557 7000 East Avenue, Livermore, CA, 94550, USA ☎ (office) (925) 423-9605 ☎ (fax) (925) 423-6961 --- Information Management & Graphics Grp., Services & Development Div., Integrated Computing & Communications Dept. (opinions expressed herein are mine and not those of LLNL) _______________________________________________ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
