Jeff, I posted a patch for this on the ticket.
Scott On Aug 26, 2010, at 10:10 AM, Scott Atchley wrote: > Hi all, > > I compiled 1.4.3rc1 with MX 1.2.12 on RHEL 5.4 (2.6.18-164.el5). It does not > like the memory manager and MX. Compiling using --without-memory-manager > works fine. The output below is form the default configure (i.e. > --with-memory-manager). > > Note, I still see unusual latencies for some tests when using the BTL such as > reduce-scatter, allgather, etc. I do not see them with the MTL. An example of > BTL latencies from reduce-scatter is: > > 256 1000 7.01 7.01 7.01 > 512 1000 7.56 7.56 7.56 > 1024 1000 8.58 8.58 8.58 > 2048 1000 10.36 10.36 10.36 > 4096 1000 14.49 14.49 14.49 > 8192 1000 5180.16 5180.57 5180.36 > 16384 1000 94.96 94.97 94.96 > 32768 1000 4676.30 4676.68 4676.49 > 65536 640 4625.85 4626.23 4626.04 > 131072 320 243.43 243.46 243.45 > 262144 160 425.56 425.66 425.61 > > Scott > > % mpirun -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong > [rain16:22509] *** Process received signal *** > [rain16:22509] Signal: Segmentation fault (11) > [rain16:22509] Signal code: Address not mapped (1) > [rain16:22509] Failing at address: 0x2c0 > [rain15:24145] *** Process received signal *** > [rain15:24145] Signal: Segmentation fault (11) > [rain15:24145] Signal code: Address not mapped (1) > [rain15:24145] Failing at address: 0x25a0 > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 22509 on node rain16 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > gdb shows: > > #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1 > (gdb) bt > #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1 > #1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6 > #3 0x00002af68e7a47de in opal_backtrace_buffer () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #4 0x00002af68e7a24ce in show_stackframe () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #5 <signal handler called> > #6 0x00000000000002c0 in ?? () > #7 0x00002af690520640 in mca_mpool_fake_release_memory () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so > #8 0x00002af68e2f49ce in mca_mpool_base_mem_cb () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #9 0x00002af68e78347b in opal_mem_hooks_release_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #10 0x00002af68e7a791f in opal_mem_free_ptmalloc2_munmap () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #11 0x00002af68e7ac2b1 in opal_memory_ptmalloc2_free_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #12 0x0000003d060727c1 in free () from /lib64/libc.so.6 > #13 0x00002af69197aaad in mx__rl_fini (rl=0xab5f928) > at ../../../libmyriexpress/userspace/../mx__request.c:102 > #14 0x00002af69196924d in mx_close_endpoint (endpoint=0xab5f820) > at ../../../libmyriexpress/userspace/../mx_close_endpoint.c:124 > #15 0x00002af69155e3dc in ompi_mtl_mx_finalize () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mtl_mx.so > #16 0x00002af68e2f87e0 in mca_pml_base_select () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #17 0x00002af68e2bcf40 in ompi_mpi_init () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #18 0x00002af68e2da2b1 in PMPI_Init_thread () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #19 0x0000000000403359 in main () > > > If I tell it to use BTLs only it changes to: > > % mpirun -mca pml ob1 -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong > [rain16:22552] *** Process received signal *** > [rain15:24195] *** Process received signal *** > [rain15:24195] Signal: Segmentation fault (11) > [rain15:24195] Signal code: Address not mapped (1) > [rain15:24195] Failing at address: 0x290 > [rain16:22552] Signal: Segmentation fault (11) > [rain16:22552] Signal code: Address not mapped (1) > [rain16:22552] Failing at address: 0x290 > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 22552 on node rain16 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > gdb shows: > > #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1 > #1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6 > #3 0x00002b8310ee17de in opal_backtrace_buffer () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #4 0x00002b8310edf4ce in show_stackframe () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #5 <signal handler called> > #6 0x0000000000000290 in ?? () > #7 0x00002b8312c5d640 in mca_mpool_fake_release_memory () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so > #8 0x00002b8310a319ce in mca_mpool_base_mem_cb () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #9 0x00002b8310ec047b in opal_mem_hooks_release_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #10 0x00002b8310ee5195 in sYSTRIm () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #11 0x00002b8310ee92da in opal_memory_ptmalloc2_free_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #12 0x0000003d060727c1 in free () from /lib64/libc.so.6 > #13 0x0000003d060960bd in closedir () from /lib64/libc.so.6 > #14 0x00002b8310ec7cc9 in foreachfile_callback () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #15 0x00002b8310ec797a in foreach_dirinpath () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #16 0x00002b8310ec7a1e in lt_dlforeachfile () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #17 0x00002b8310ecf2a5 in mca_base_component_find () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #18 0x00002b8310ecfc75 in mca_base_components_open () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #19 0x00002b8310a2eb46 in ompi_dpm_base_open () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #20 0x00002b83109fa3c2 in ompi_mpi_init () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #21 0x00002b8310a172b1 in PMPI_Init_thread () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #22 0x0000000000403359 in main () > > > Lastly, with just the MTL: > > % mpirun -mca pml cm -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong > [rain16:22607] *** Process received signal *** > [rain15:24247] *** Process received signal *** > [rain15:24247] Signal: Segmentation fault (11) > [rain15:24247] Signal code: Address not mapped (1) > [rain15:24247] Failing at address: 0x38e0 > [rain16:22607] Signal: Segmentation fault (11) > [rain16:22607] Signal code: Address not mapped (1) > [rain16:22607] Failing at address: 0x38e0 > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 22607 on node rain16 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > > > gdb shows: > > #0 0x0000003d084075c8 in ?? () from /lib64/libgcc_s.so.1 > #1 0x0000003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #2 0x0000003d060e5eb8 in backtrace () from /lib64/libc.so.6 > #3 0x00002afa78ae87de in opal_backtrace_buffer () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #4 0x00002afa78ae64ce in show_stackframe () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #5 <signal handler called> > #6 0x00000000000038e0 in ?? () > #7 0x00002afa7a864640 in mca_mpool_fake_release_memory () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so > #8 0x00002afa786389ce in mca_mpool_base_mem_cb () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #9 0x00002afa78ac747b in opal_mem_hooks_release_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #10 0x00002afa78aec195 in sYSTRIm () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #11 0x00002afa78af02da in opal_memory_ptmalloc2_free_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #12 0x0000003d060727c1 in free () from /lib64/libc.so.6 > #13 0x00002afa78acec45 in foreachfile_callback () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #14 0x00002afa78ace97a in foreach_dirinpath () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #15 0x00002afa78acea1e in lt_dlforeachfile () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #16 0x00002afa78ad62a5 in mca_base_component_find () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #17 0x00002afa78ad6c75 in mca_base_components_open () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #18 0x00002afa7863ca26 in ompi_pubsub_base_open () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #19 0x00002afa78601394 in ompi_mpi_init () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #20 0x00002afa7861e2b1 in PMPI_Init_thread () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #21 0x0000000000403359 in main () > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel