On Jun 11, 2012, at 11:38 AM, Eugene Loh wrote:

> On 6/9/2012 6:49 PM, Ralph Castain wrote:
>> 
>> I fixed this one, I believe
> Sorry, I'm confused.  You think you fixed the oob:ud:qp_init one you mean?  
> Which rev has the fix?

Yes - r26587. MTT tests look pretty good at that point, per this morning's 
report.


>> will have to look more at the loop_spawn issue later.
> The original one I reported, I assume?  I see similar stacks on segfaults 
> with a variety of tests.  So, I think it's not specific to loop_spawn.

It's a race condition in the oob, so I'm looking at it.


>> 
>> On Sat, Jun 9, 2012 at 3:35 PM, Eugene Loh <eugene....@oracle.com> wrote:
>> On 6/9/2012 12:06 PM, Eugene Loh wrote:
>> With r26565:
>>    Enable orte progress threads and libevent thread support by default
>> Oracle MTT testing started showing new spawn_multiple failures.
>> Sorry.  I meant loop_spawn.
>> 
>> (And then, starting I think in 26582, the problem is masked behind another 
>> issue, "oob:ud:qp_init could not create queue pair", which is creating 
>> widespread problems for Cisco, IU, and Oracle MTT testing.  I suppose that's 
>> the subject of a different e-mail thread.)
>> 
>> I've only seen this in 64-bit.  Here are two segfaults, both from Linux/x86 
>> systems running over TCP:
>> 
>> This one with GNU compilers:
>>    [...]
>>    parent: MPI_Comm_spawn #300 return : 0
>>    [burl-ct-v20z-26:28518] *** Process received signal ***
>>    [burl-ct-v20z-26:28518] Signal: Segmentation fault (11)
>>    [burl-ct-v20z-26:28518] Signal code: Address not mapped (1)
>>    [burl-ct-v20z-26:28518] Failing at address: (nil)
>>    [burl-ct-v20z-26:28518] [ 0] /lib64/libpthread.so.0 [0x3a21c0e7c0]
>>    [burl-ct-v20z-26:28518] [ 1] /lib64/libc.so.6(memcpy+0x35) [0x3a2107bde5]
>>    [burl-ct-v20z-26:28518] [ 2] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_copy+0x58)
>>    [burl-ct-v20z-26:28518] [ 3] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so
>>    [burl-ct-v20z-26:28518] [ 4] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv_nb+0x314)
>>    [burl-ct-v20z-26:28518] [ 5] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_rml_oob.so(orte_rml_oob_recv_buffer_nb+0xff)
>>    [burl-ct-v20z-26:28518] [ 6] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/openmpi/mca_dpm_orte.so
>>    [burl-ct-v20z-26:28518] [ 7] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-v20z-26/ompi-tarball-testing/installs/smMv/install/lib/lib64/libmpi.so.0(PMPI_Comm_spawn+0x2ee)
>>    [burl-ct-v20z-26:28518] [ 8] dynamic/loop_spawn [0x40120b]
>>    [burl-ct-v20z-26:28518] [ 9] /lib64/libc.so.6(__libc_start_main+0xf4) 
>> [0x3a2101d994]
>>    [burl-ct-v20z-26:28518] [10] dynamic/loop_spawn [0x400dd9]
>>    [burl-ct-v20z-26:28518] *** End of error message ***
>> 
>> This one with Oracle Studio compilers:
>>    parent: MPI_Comm_spawn #0 return : 0
>>    parent: MPI_Comm_spawn #20 return : 0
>>    [burl-ct-x2200-12:02348] *** Process received signal ***
>>    [burl-ct-x2200-12:02348] Signal: Segmentation fault (11)
>>    [burl-ct-x2200-12:02348] Signal code: Address not mapped (1)
>>    [burl-ct-x2200-12:02348] Failing at address: 0x10
>>    [burl-ct-x2200-12:02348] [ 0] /lib64/libpthread.so.0 [0x318ac0de80]
>>    [burl-ct-x2200-12:02348] [ 1] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so(mca_oob_tcp_msg_recv_handler+0xe3)
>>    [burl-ct-x2200-12:02348] [ 2] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/openmpi/mca_oob_tcp.so
>>    [burl-ct-x2200-12:02348] [ 3] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
>>    [burl-ct-x2200-12:02348] [ 4] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0(opal_libevent2019_event_base_loop+0x7c7)
>>    [burl-ct-x2200-12:02348] [ 5] 
>> /workspace/tdontje/hpc/mtt-scratch/burl-ct-x2200-12/ompi-tarball-testing/installs/Q7wT/install/lib/lib64/libmpi.so.0
>>    [burl-ct-x2200-12:02348] [ 6] /lib64/libpthread.so.0 [0x318ac06307]
>>    [burl-ct-x2200-12:02348] [ 7] /lib64/libc.so.6(clone+0x6d) [0x318a0d1ded]
>>    [burl-ct-x2200-12:02348] *** End of error message ***
>> 
>> Sometimes, I see a hang rather than a segfault.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to