Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Jeff Squyres
To be clear -- this looks like a different issue than what Pasha was  
reporting.



On Jul 15, 2008, at 8:55 AM, Rolf vandeVaart wrote:



Lenny, I opened a ticket for something that looks the same as this.  
Maybe you can add your details to it.


https://svn.open-mpi.org/trac/ompi/ticket/1386

Rolf

Lenny Verkhovsky wrote:


I guess it should be here, sorry.

/home/USERS/lenny/OMPI_ORTE_18850/bin/mpirun -np 2 -H  
witch2,witch3 ./IMB-MPI1_18850 PingPong

#---
# Intel (R) MPI Benchmark Suite V3.0v modified by Voltaire, MPI-1  
part

#---
# Date : Tue Jul 15 15:11:30 2008
# Machine : x86_64
# System : Linux
# Release : 2.6.16.46-0.12-smp
# Version : #1 SMP Thu May 17 14:00:09 UTC 2007
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 67108864
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
[witch3:32461] *** Process received signal ***
[witch3:32461] Signal: Segmentation fault (11)
[witch3:32461] Signal code: Address not mapped (1)
[witch3:32461] Failing at address: 0x20
[witch3:32461] [ 0] /lib64/libpthread.so.0 [0x2b514fcedc10]
[witch3:32461] [ 1] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510b416a]
[witch3:32461] [ 2] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510b4661]
[witch3:32461] [ 3] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510b180e]
[witch3:32461] [ 4] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_btl_openib.so [0x2b5151811c22]
[witch3:32461] [ 5] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_btl_openib.so [0x2b51518132e9]
[witch3:32461] [ 6] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_bml_r2.so [0x2b51512c412f]
[witch3:32461] [ 7] /home/USERS/lenny/OMPI_ORTE_18850/lib/libopen- 
pal.so.0(opal_progress+0x5a) [0x2b514f71268a]
[witch3:32461] [ 8] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/ 
mca_pml_ob1.so [0x2b51510af0f5]
[witch3:32461] [ 9] /home/USERS/lenny/OMPI_ORTE_18850/lib/libmpi.so. 
0(PMPI_Recv+0x13b) [0x2b514f47941b]

[witch3:32461] [10] ./IMB-MPI1_18850(IMB_pingpong+0x1a1) [0x4073cd]
[witch3:32461] [11] ./IMB-MPI1_18850(IMB_warm_up+0x2d) [0x405e49]
[witch3:32461] [12] ./IMB-MPI1_18850(main+0x394) [0x4034d4]
[witch3:32461] [13] /lib64/libc.so.6(__libc_start_main+0xf4)  
[0x2b514fe14154]

[witch3:32461] [14] ./IMB-MPI1_18850 [0x4030a9]
[witch3:32461] *** End of error message ***
mpirun: killing job...

--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
 witch2
 witch3


On 7/15/08, *Pavel Shamis (Pasha)* > wrote:



   It looks like a new issue to me, Pasha. Possibly a side
   consequence of the
   IOF change made by Jeff and I the other day. From what I can
   see, it looks
   like you app was a simple "hello" - correct?

   Yep, it is simple hello application.

   If you look at the error, the problem occurs when mpirun is
   trying to route
   a message. Since the app is clearly running at this time, the
   problem is
   probably in the IOF. The error message shows that mpirun is
   attempting to
   route a message to a jobid that doesn't exist. We have a test
   in the RML
   that forces an "abort" if that occurs.

   I would guess that there is either a race condition or memory
   corruption
   occurring somewhere, but I have no idea where.

   This may be the "new hole in the dyke" I cautioned about in
   earlier notes
   regarding the IOF... :-)

   Still, given that this hits rarely, it probably is a more
   acceptable bug to
   leave in the code than the one we just fixed (duplicated  
stdin)...


   It is not so rare issue, 19 failures in my MTT run
   (http://www.open-mpi.org/mtt/index.php?do_redir=765).

   Pasha

   Ralph



   On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"
   >
   wrote:


   Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

   The error is not consistent. It takes a lot of iteration
   to reproduce it.
   In my MTT testing I seen it few times.

   Is it know issue ?

   Regards,
   Pasha
   ___
   devel mailing list
   de...@open-mpi.org 
   http://www.open-mpi.org/mailman/listinfo.cgi/devel



   

Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Pavel Shamis (Pasha)

I opened ticket for the bug:
https://svn.open-mpi.org/trac/ompi/ticket/1389

Ralph Castain wrote:

It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"  wrote:

  

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  




Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Rolf vandeVaart


Lenny, I opened a ticket for something that looks the same as this. 
Maybe you can add your details to it.


https://svn.open-mpi.org/trac/ompi/ticket/1386

Rolf

Lenny Verkhovsky wrote:


I guess it should be here, sorry.

/home/USERS/lenny/OMPI_ORTE_18850/bin/mpirun -np 2 -H witch2,witch3 
./IMB-MPI1_18850 PingPong

#---
# Intel (R) MPI Benchmark Suite V3.0v modified by Voltaire, MPI-1 part
#---
# Date : Tue Jul 15 15:11:30 2008
# Machine : x86_64
# System : Linux
# Release : 2.6.16.46-0.12-smp
# Version : #1 SMP Thu May 17 14:00:09 UTC 2007
# MPI Version : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 67108864
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong
[witch3:32461] *** Process received signal ***
[witch3:32461] Signal: Segmentation fault (11)
[witch3:32461] Signal code: Address not mapped (1)
[witch3:32461] Failing at address: 0x20
[witch3:32461] [ 0] /lib64/libpthread.so.0 [0x2b514fcedc10]
[witch3:32461] [ 1] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510b416a]
[witch3:32461] [ 2] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510b4661]
[witch3:32461] [ 3] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510b180e]
[witch3:32461] [ 4] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_btl_openib.so 
[0x2b5151811c22]
[witch3:32461] [ 5] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_btl_openib.so 
[0x2b51518132e9]
[witch3:32461] [ 6] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_bml_r2.so 
[0x2b51512c412f]
[witch3:32461] [ 7] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/libopen-pal.so.0(opal_progress+0x5a) 
[0x2b514f71268a]
[witch3:32461] [ 8] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so 
[0x2b51510af0f5]
[witch3:32461] [ 9] 
/home/USERS/lenny/OMPI_ORTE_18850/lib/libmpi.so.0(PMPI_Recv+0x13b) 
[0x2b514f47941b]

[witch3:32461] [10] ./IMB-MPI1_18850(IMB_pingpong+0x1a1) [0x4073cd]
[witch3:32461] [11] ./IMB-MPI1_18850(IMB_warm_up+0x2d) [0x405e49]
[witch3:32461] [12] ./IMB-MPI1_18850(main+0x394) [0x4034d4]
[witch3:32461] [13] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x2b514fe14154]

[witch3:32461] [14] ./IMB-MPI1_18850 [0x4030a9]
[witch3:32461] *** End of error message ***
mpirun: killing job...

--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
  witch2
  witch3


On 7/15/08, *Pavel Shamis (Pasha)* > wrote:



It looks like a new issue to me, Pasha. Possibly a side
consequence of the
IOF change made by Jeff and I the other day. From what I can
see, it looks
like you app was a simple "hello" - correct?
 


Yep, it is simple hello application.

If you look at the error, the problem occurs when mpirun is
trying to route
a message. Since the app is clearly running at this time, the
problem is
probably in the IOF. The error message shows that mpirun is
attempting to
route a message to a jobid that doesn't exist. We have a test
in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory
corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in
earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more
acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...
 


It is not so rare issue, 19 failures in my MTT run
(http://www.open-mpi.org/mtt/index.php?do_redir=765).

Pasha

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"
>
wrote:

 


Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration
to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   




___
devel mailing list

Re: [OMPI devel] Segfault in 1.3 branch

2008-07-15 Thread Pavel Shamis (Pasha)



It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?
  

Yep, it is simple hello application.

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...
  
It is not so rare issue, 19 failures in my MTT run 
(http://www.open-mpi.org/mtt/index.php?do_redir=765).


Pasha

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"  wrote:

  

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  




Re: [OMPI devel] Segfault in 1.3 branch

2008-07-14 Thread Ralph Castain
It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"  wrote:

> Please see http://www.open-mpi.org/mtt/index.php?do_redir=764
> 
> The error is not consistent. It takes a lot of iteration to reproduce it.
> In my MTT testing I seen it few times.
> 
> Is it know issue ?
> 
> Regards,
> Pasha
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Segfault in 1.3 branch

2008-07-14 Thread Pavel Shamis (Pasha)

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha