Re: [Paraview] [ParaView3-Developer] Locking (MPI?) problems with ParaView 3.5
John, Here's my understanding of what may be happening: * pvclient has done a gather data information and is waiting to receive that data * pvserver0 has received this gather data request is asking all satellites to gather data * pvserver1 also received that gather data request but before is starts processing it, pv1 needs to pass on the message to it's child i.e process 3 (this is the fancy tree based message communication that we are now using. So now, instead of the root node sending the message to all processes one after the other, it uses a tree structure). * pvserver2 I believe is happily waiting for the next message. It already processed the gather data request, sent the response to the root (although root has not processed that response yet) and now is waiting for the next request. * pvserver3 is the rogue. It's stuck in Cancel, hence it won't receive any messages other send to it. This cancel was not done as a part of this gather data request but was done at the end of some previous request. Thus it's not yet able to receive the gather data request pvserver1 is trying to send, consequently freezing pvserver1. Of course, this is just one possible explanation. I've attached a patch. Can you please test if this overcomes the issue. If not, you may also want to try by connecting the Cancel out all together. That will at the least help us narrow down the problem. Utkarsh On Tue, Feb 10, 2009 at 2:56 AM, John Biddiscombe biddi...@cscs.ch wrote: I've been having problems with paraview 3.5 (CVS) since I started working on a parallel problem last week (which means it may have existed for some time, but I was doing other things). How to reproduce the problem Start pvserver on 4 (or more) nodes and pvclient on 1 Create Sources/wavelet ParaView hangs indefinitley without showing bounding box of cube. Reproducibility : not always (3 out of 4 times - for me). Sometimes paraview completes as usual, but shows other hanging symptoms later on. ParaView 3.4 does not have this problem. Something seems to have changed between 3.4 and 3.5 which is affecting the client server delivery of data. It appears to be an MPI related issue. I have attached the debugger to all pvserver nodes and the stack traces are shown below To summarize : pvserver 3 : Cancelled a request pvserver 2 : waiting in ReceiveDataInternal pvserver 1 : still in vtkMPICommunicatorSendDatachar pvserver 0 : still in vtkMPICommunicatorSendDatachar pvclient : trying to Receive It seems that one of the pvserver nodes has given up sending, or has nothing to send and the others are waiting without end. Notes 1: It took me time to go from one node to the other and attach gdb, so if I stopped one task - another might have been affected before it had its stack trace dumped. I'm not very expert at gdb. Notes 2: ParaView 3.4 seems to work flawlessly. Something is different in 3.5 Notes 3: I ran the b_eff_io ( https://fs.hlrs.de/projects/par/mpi//b_eff_io/ - NB. link unreliable) on 15 nodes and it completed without error and without any unusual behaviuor, so my instinct is that our hpmpi is working ok. Questions : 1) Has anyone else observed this behaviour? 2) Has anyone changed anything in the mpi-commmunicator code which might have caused this behaviour 3) It's possible we have a network problem and this is causing the locking, but our logs do not show any errors - can we rule this out? 4) If anyone has answered yes to 1/2, do they know what's wrong and can they fix it? I welcome help as this is preventing me finishing my current project. JB Stack trace for pvserver 3 == #0 0x003dbd4afba9 in sched_yield () from /lib64/tls/libc.so.6 #1 0x002a9a725a97 in hpmp_yield () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #2 0x002a9a71f3b4 in hpmp_adv () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #3 0x002a9a757012 in hpmp_waitany () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #4 0x002a9a756cf0 in VMPI_Waitany () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #5 0x002a9a753a9e in VMPI_Wait () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #6 0x002a9a754520 in VMPI_Cancel () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #7 0x002a99e27045 in vtkMPICommunicator::Request::Cancel (this=0xacf740) at /users/biddisco/code/pv-meshless/VTK/Parallel/vtkMPICommunicator.cxx:973 #8 0x002a95639e36 in vtkPVProgressHandler::CleanupSatellites (this=0xae8dd0) at /users/biddisco/code/pv-meshless/Servers/Common/vtkPVProgressHandler.cxx:399 #9 0x002a95639c85 in vtkPVProgressHandler::CleanupPendingProgress (this=0xae8dd0) at /users/biddisco/code/pv-meshless/Servers/Common/vtkPVProgressHandler.cxx:337 #10 0x002a956061ce in vtkProcessModule::CleanupPendingProgress (this=0x590ee0) at /users/biddisco/code/pv-meshless/Servers/Common/vtkProcessModule.cxx:1267 #11 0x002a965b22f7 in vtkProcessModuleCommand (arlu=0x591d20, ob=0x590ee0,
Re: [Paraview] [ParaView3-Developer] Locking (MPI?) problems with ParaView 3.5
Sweet. I've committed the patch to CVS. On Tue, Feb 10, 2009 at 10:45 AM, John Biddiscombe biddi...@cscs.ch wrote: Utkarsh Looks good. 10 out of 10 successes with the patch applied (using 5 to 10 pvserver nodes) - I'll try with more later when the rest of the machine is free. Thanks yet again for your help. If the problem comes back in another flavour I'll be sure to let you know. JB John, Here's my understanding of what may be happening: * pvclient has done a gather data information and is waiting to receive that data * pvserver0 has received this gather data request is asking all satellites to gather data * pvserver1 also received that gather data request but before is starts processing it, pv1 needs to pass on the message to it's child i.e process 3 (this is the fancy tree based message communication that we are now using. So now, instead of the root node sending the message to all processes one after the other, it uses a tree structure). * pvserver2 I believe is happily waiting for the next message. It already processed the gather data request, sent the response to the root (although root has not processed that response yet) and now is waiting for the next request. * pvserver3 is the rogue. It's stuck in Cancel, hence it won't receive any messages other send to it. This cancel was not done as a part of this gather data request but was done at the end of some previous request. Thus it's not yet able to receive the gather data request pvserver1 is trying to send, consequently freezing pvserver1. Of course, this is just one possible explanation. I've attached a patch. Can you please test if this overcomes the issue. If not, you may also want to try by connecting the Cancel out all together. That will at the least help us narrow down the problem. Utkarsh On Tue, Feb 10, 2009 at 2:56 AM, John Biddiscombe biddi...@cscs.ch wrote: I've been having problems with paraview 3.5 (CVS) since I started working on a parallel problem last week (which means it may have existed for some time, but I was doing other things). How to reproduce the problem Start pvserver on 4 (or more) nodes and pvclient on 1 Create Sources/wavelet ParaView hangs indefinitley without showing bounding box of cube. Reproducibility : not always (3 out of 4 times - for me). Sometimes paraview completes as usual, but shows other hanging symptoms later on. ParaView 3.4 does not have this problem. Something seems to have changed between 3.4 and 3.5 which is affecting the client server delivery of data. It appears to be an MPI related issue. I have attached the debugger to all pvserver nodes and the stack traces are shown below To summarize : pvserver 3 : Cancelled a request pvserver 2 : waiting in ReceiveDataInternal pvserver 1 : still in vtkMPICommunicatorSendDatachar pvserver 0 : still in vtkMPICommunicatorSendDatachar pvclient : trying to Receive It seems that one of the pvserver nodes has given up sending, or has nothing to send and the others are waiting without end. Notes 1: It took me time to go from one node to the other and attach gdb, so if I stopped one task - another might have been affected before it had its stack trace dumped. I'm not very expert at gdb. Notes 2: ParaView 3.4 seems to work flawlessly. Something is different in 3.5 Notes 3: I ran the b_eff_io ( https://fs.hlrs.de/projects/par/mpi//b_eff_io/ - NB. link unreliable) on 15 nodes and it completed without error and without any unusual behaviuor, so my instinct is that our hpmpi is working ok. Questions : 1) Has anyone else observed this behaviour? 2) Has anyone changed anything in the mpi-commmunicator code which might have caused this behaviour 3) It's possible we have a network problem and this is causing the locking, but our logs do not show any errors - can we rule this out? 4) If anyone has answered yes to 1/2, do they know what's wrong and can they fix it? I welcome help as this is preventing me finishing my current project. JB Stack trace for pvserver 3 == #0 0x003dbd4afba9 in sched_yield () from /lib64/tls/libc.so.6 #1 0x002a9a725a97 in hpmp_yield () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #2 0x002a9a71f3b4 in hpmp_adv () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #3 0x002a9a757012 in hpmp_waitany () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #4 0x002a9a756cf0 in VMPI_Waitany () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #5 0x002a9a753a9e in VMPI_Wait () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #6 0x002a9a754520 in VMPI_Cancel () from /opt/hpmpi/lib/linux_amd64/libmpi.so.1 #7 0x002a99e27045 in vtkMPICommunicator::Request::Cancel (this=0xacf740) at /users/biddisco/code/pv-meshless/VTK/Parallel/vtkMPICommunicator.cxx:973 #8 0x002a95639e36 in vtkPVProgressHandler::CleanupSatellites (this=0xae8dd0) at