Re: [Paraview] [ParaView3-Developer] Locking (MPI?) problems with ParaView 3.5

2009-02-10 Thread Utkarsh Ayachit
John,

Here's my understanding of what may be happening:
* pvclient has done a gather data information and is waiting to
receive that data
* pvserver0 has received this gather data request is asking all
satellites to gather data
* pvserver1 also received that gather data request but before is
starts processing it, pv1 needs to pass on the message to it's child
i.e process 3 (this is the fancy tree based message communication that
we are now using. So now, instead of the root node sending the message
to all processes one after the other, it uses a tree structure).
* pvserver2 I believe is happily waiting for the next message. It
already processed the gather data request, sent the response to the
root (although root has not processed that response yet) and now is
waiting for  the next request.
* pvserver3 is the rogue. It's stuck in Cancel, hence it won't
receive any messages other send to it. This cancel was not done as a
part of this gather data request but was done at the end of some
previous request. Thus it's not yet able to receive the gather data
request pvserver1 is trying to send, consequently freezing pvserver1.

Of course, this is just one possible explanation. I've attached a
patch. Can you please test if this overcomes the issue. If not, you
may also want to try by connecting the Cancel out all together. That
will at the least help us narrow down the problem.

Utkarsh


On Tue, Feb 10, 2009 at 2:56 AM, John Biddiscombe biddi...@cscs.ch wrote:
 I've been having problems with paraview 3.5 (CVS) since I started working on
 a parallel problem last week (which means it may have existed for some time,
 but I was doing other things).

 How to reproduce the problem
 Start pvserver on 4 (or more) nodes and pvclient on 1
 Create Sources/wavelet
 ParaView hangs indefinitley without showing bounding box of cube.

 Reproducibility : not always (3 out of 4 times - for me). Sometimes paraview
 completes as usual, but shows other hanging symptoms later on.

 ParaView 3.4 does not have this problem.

 Something seems to have changed between 3.4 and 3.5 which is affecting the
 client server delivery of data. It appears to be an MPI related issue. I
 have attached the debugger to all pvserver nodes and the stack traces are
 shown below

 To summarize :
 pvserver 3 : Cancelled a request
 pvserver 2 : waiting in ReceiveDataInternal
 pvserver 1 : still in vtkMPICommunicatorSendDatachar
 pvserver 0 : still in vtkMPICommunicatorSendDatachar
 pvclient : trying to Receive

 It seems that one of the pvserver nodes has given up sending, or has nothing
 to send and the others are waiting without end.

 Notes 1: It took me time to go from one node to the other and attach gdb, so
 if I stopped one task - another might have been affected before it had its
 stack trace dumped. I'm not very expert at gdb.

 Notes 2: ParaView 3.4 seems to work flawlessly. Something is different in
 3.5

 Notes 3: I ran the b_eff_io ( https://fs.hlrs.de/projects/par/mpi//b_eff_io/
 - NB. link unreliable) on 15 nodes and it completed without error and
 without any unusual behaviuor, so my instinct is that our hpmpi is working
 ok.

 Questions :
 1) Has anyone else observed this behaviour?
 2) Has anyone changed anything in the mpi-commmunicator code which might
 have caused this behaviour
 3) It's possible we have a network problem and this is causing the locking,
 but our logs do not show any errors - can we rule this out?
 4) If anyone has answered yes to 1/2, do they know what's wrong and can they
 fix it?

 I welcome help as this is preventing me finishing my current project.

 JB


 Stack trace for pvserver 3
 ==

 #0  0x003dbd4afba9 in sched_yield () from /lib64/tls/libc.so.6
 #1  0x002a9a725a97 in hpmp_yield () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #2  0x002a9a71f3b4 in hpmp_adv () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #3  0x002a9a757012 in hpmp_waitany () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #4  0x002a9a756cf0 in VMPI_Waitany () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #5  0x002a9a753a9e in VMPI_Wait () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #6  0x002a9a754520 in VMPI_Cancel () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #7  0x002a99e27045 in vtkMPICommunicator::Request::Cancel
 (this=0xacf740) at
 /users/biddisco/code/pv-meshless/VTK/Parallel/vtkMPICommunicator.cxx:973
 #8  0x002a95639e36 in vtkPVProgressHandler::CleanupSatellites
 (this=0xae8dd0) at
 /users/biddisco/code/pv-meshless/Servers/Common/vtkPVProgressHandler.cxx:399
 #9  0x002a95639c85 in vtkPVProgressHandler::CleanupPendingProgress
 (this=0xae8dd0) at
 /users/biddisco/code/pv-meshless/Servers/Common/vtkPVProgressHandler.cxx:337
 #10 0x002a956061ce in vtkProcessModule::CleanupPendingProgress
 (this=0x590ee0) at
 /users/biddisco/code/pv-meshless/Servers/Common/vtkProcessModule.cxx:1267
 #11 0x002a965b22f7 in vtkProcessModuleCommand (arlu=0x591d20,
 ob=0x590ee0, 

Re: [Paraview] [ParaView3-Developer] Locking (MPI?) problems with ParaView 3.5

2009-02-10 Thread Utkarsh Ayachit
Sweet. I've committed the patch to CVS.

On Tue, Feb 10, 2009 at 10:45 AM, John Biddiscombe biddi...@cscs.ch wrote:
 Utkarsh

 Looks good. 10 out of 10 successes with the patch applied (using 5 to 10
 pvserver nodes) - I'll try with more later when the rest of the machine is
 free.

 Thanks yet again for your help. If the problem comes back in another flavour
 I'll be sure to let you know.

 JB



 John,

 Here's my understanding of what may be happening:
 * pvclient has done a gather data information and is waiting to
 receive that data
 * pvserver0 has received this gather data request is asking all
 satellites to gather data
 * pvserver1 also received that gather data request but before is
 starts processing it, pv1 needs to pass on the message to it's child
 i.e process 3 (this is the fancy tree based message communication that
 we are now using. So now, instead of the root node sending the message
 to all processes one after the other, it uses a tree structure).
 * pvserver2 I believe is happily waiting for the next message. It
 already processed the gather data request, sent the response to the
 root (although root has not processed that response yet) and now is
 waiting for  the next request.
 * pvserver3 is the rogue. It's stuck in Cancel, hence it won't
 receive any messages other send to it. This cancel was not done as a
 part of this gather data request but was done at the end of some
 previous request. Thus it's not yet able to receive the gather data
 request pvserver1 is trying to send, consequently freezing pvserver1.

 Of course, this is just one possible explanation. I've attached a
 patch. Can you please test if this overcomes the issue. If not, you
 may also want to try by connecting the Cancel out all together. That
 will at the least help us narrow down the problem.

 Utkarsh


 On Tue, Feb 10, 2009 at 2:56 AM, John Biddiscombe biddi...@cscs.ch
 wrote:


 I've been having problems with paraview 3.5 (CVS) since I started working
 on
 a parallel problem last week (which means it may have existed for some
 time,
 but I was doing other things).

 How to reproduce the problem
 Start pvserver on 4 (or more) nodes and pvclient on 1
 Create Sources/wavelet
 ParaView hangs indefinitley without showing bounding box of cube.

 Reproducibility : not always (3 out of 4 times - for me). Sometimes
 paraview
 completes as usual, but shows other hanging symptoms later on.

 ParaView 3.4 does not have this problem.

 Something seems to have changed between 3.4 and 3.5 which is affecting
 the
 client server delivery of data. It appears to be an MPI related issue. I
 have attached the debugger to all pvserver nodes and the stack traces are
 shown below

 To summarize :
 pvserver 3 : Cancelled a request
 pvserver 2 : waiting in ReceiveDataInternal
 pvserver 1 : still in vtkMPICommunicatorSendDatachar
 pvserver 0 : still in vtkMPICommunicatorSendDatachar
 pvclient : trying to Receive

 It seems that one of the pvserver nodes has given up sending, or has
 nothing
 to send and the others are waiting without end.

 Notes 1: It took me time to go from one node to the other and attach gdb,
 so
 if I stopped one task - another might have been affected before it had
 its
 stack trace dumped. I'm not very expert at gdb.

 Notes 2: ParaView 3.4 seems to work flawlessly. Something is different in
 3.5

 Notes 3: I ran the b_eff_io (
 https://fs.hlrs.de/projects/par/mpi//b_eff_io/
 - NB. link unreliable) on 15 nodes and it completed without error and
 without any unusual behaviuor, so my instinct is that our hpmpi is
 working
 ok.

 Questions :
 1) Has anyone else observed this behaviour?
 2) Has anyone changed anything in the mpi-commmunicator code which might
 have caused this behaviour
 3) It's possible we have a network problem and this is causing the
 locking,
 but our logs do not show any errors - can we rule this out?
 4) If anyone has answered yes to 1/2, do they know what's wrong and can
 they
 fix it?

 I welcome help as this is preventing me finishing my current project.

 JB


 Stack trace for pvserver 3
 ==

 #0  0x003dbd4afba9 in sched_yield () from /lib64/tls/libc.so.6
 #1  0x002a9a725a97 in hpmp_yield () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #2  0x002a9a71f3b4 in hpmp_adv () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #3  0x002a9a757012 in hpmp_waitany () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #4  0x002a9a756cf0 in VMPI_Waitany () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #5  0x002a9a753a9e in VMPI_Wait () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #6  0x002a9a754520 in VMPI_Cancel () from
 /opt/hpmpi/lib/linux_amd64/libmpi.so.1
 #7  0x002a99e27045 in vtkMPICommunicator::Request::Cancel
 (this=0xacf740) at
 /users/biddisco/code/pv-meshless/VTK/Parallel/vtkMPICommunicator.cxx:973
 #8  0x002a95639e36 in vtkPVProgressHandler::CleanupSatellites
 (this=0xae8dd0) at