tried with vader - same crash *14:14:22* [vegas12:32068] 7 more processes have sent help message help-mca-var.txt / deprecated-mca-env*14:14:22* [vegas12:32068] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages*14:14:22* + LD_LIBRARY_PATH=/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib*14:14:22* + OMPI_MCA_scoll_fca_enable=1*14:14:22* + OMPI_MCA_scoll_fca_np=0*14:14:22* + OMPI_MCA_pml=ob1*14:14:22* + OMPI_MCA_btl=vader,self,openib*14:14:22* + OMPI_MCA_spml=yoda*14:14:22* + OMPI_MCA_memheap_mr_interleave_factor=8*14:14:22* + OMPI_MCA_memheap=ptmalloc*14:14:22* + OMPI_MCA_btl_openib_if_include=mlx4_0:1*14:14:22* + OMPI_MCA_rmaps_base_dist_hca=mlx4_0*14:14:22* + OMPI_MCA_memheap_base_hca_name=mlx4_0*14:14:22* + OMPI_MCA_rmaps_base_mapping_policy=dist:mlx4_0*14:14:22* + MXM_RDMA_PORTS=mlx4_0:1*14:14:22* + SHMEM_SYMMETRIC_HEAP_SIZE=1024M*14:14:22* + timeout -s SIGSEGV 3m /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/bin/oshrun -np 8 /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem*14:14:22* [vegas12][[4652,1],1][btl_openib_component.c:909:device_destruct] Failed to cancel OpenIB progress thread*14:14:22* [vegas12][[4652,1],0][btl_openib_component.c:909:device_destruct] Failed to cancel OpenIB progress thread*14:14:22* --------------------------------------------------------------------------*14:14:22* WARNING: The openib BTL was directed to use "eager RDMA" for short*14:14:22* messages, but the openib BTL was compiled with progress threads*14:14:22* support. Short eager RDMA is not yet supported with progress threads;*14:14:22* its use has been disabled in this job.*14:14:22* *14:14:22* This is a warning only; you job will attempt to continue.*14:14:22* --------------------------------------------------------------------------*14:14:22* [vegas12][[4652,1],5][btl_openib_component.c:909:device_destruct] Failed to cancel OpenIB progress thread*14:14:22* [vegas12:32108] *** Process received signal ****14:14:22* [vegas12:32108] Signal: Segmentation fault (11)*14:14:22* [vegas12:32108] Signal code: Address not mapped (1)*14:14:22* [vegas12:32108] Failing at address: (nil)*14:14:22* [vegas12:32108] [ 0] /lib64/libpthread.so.0[0x3937c0f500]*14:14:22* [vegas12:32108] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_comp_channel+0x16)[0x3b7760bf46]*14:14:22* [vegas12:32108] [ 2] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_btl_openib.so(+0xdf02)[0x7ffff3fc1f02]*14:14:22* [vegas12:32108] [ 3] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_btl_openib.so(+0xf161)[0x7ffff3fc3161]*14:14:22* [vegas12:32108] [ 4] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_btl_openib.so(+0x12ab1)[0x7ffff3fc6ab1]*14:14:22* [vegas12:32108] [ 5] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(mca_btl_base_select+0x117)[0x7ffff7a29807]*14:14:22* [vegas12:32108] [ 6] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x12)[0x7ffff41ed7e2]*14:14:22* [vegas12:32108] [ 7] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(mca_bml_base_init+0x99)[0x7ffff7a29009]*14:14:22* [vegas12:32108] [ 8] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/openmpi/mca_pml_ob1.so(+0x58b5)[0x7ffff35848b5]*14:14:22* [vegas12:32108] [ 9] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(mca_pml_base_select+0x1e0)[0x7ffff7a3c590]*14:14:22* [vegas12:32108] [10] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/libmpi.so.0(ompi_mpi_init+0x455)[0x7ffff7a06bf5]*14:14:22* [vegas12:32108] [11] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/liboshmem.so.0(oshmem_shmem_init+0xfd)[0x7ffff7ca66dd]*14:14:22* [vegas12:32108] [12] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib/liboshmem.so.0(shmem_init+0x28)[0x7ffff7ca9328]*14:14:22* [vegas12:32108] [13] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem[0x40077d]*14:14:22* [vegas12:32108] [14] /lib64/libc.so.6(__libc_start_main+0xfd)[0x393741ecdd]*14:14:22* [vegas12:32108] [15] /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem[0x4006a9]*14:14:22* [vegas12:32108] *** End of error message ****14:14:22* [vegas12:32112] *** Process received signal ****14:14:22* [vegas12:32112] Signal: Segmentation fault (11)*14:14:*
On Wed, Jun 25, 2014 at 9:11 AM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Mike, > > could you try again with > > OMPI_MCA_btl=vader,self,openib > > it seems the sm module causes a hang > (which later causes the timeout sending a SIGSEGV) > > Cheers, > > Gilles > > On 2014/06/25 14:22, Mike Dubman wrote: > > Hi, > > The following commit broke trunk in jenkins: > > > >>>> Per the OMPI developer conference, remove the last vestiges of > > OMPI_USE_PROGRESS_THREADS > > > > *22:15:09* + > LD_LIBRARY_PATH=/scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/lib*22:15:09* > > + OMPI_MCA_scoll_fca_enable=1*22:15:09* + > > OMPI_MCA_scoll_fca_np=0*22:15:09* + OMPI_MCA_pml=ob1*22:15:09* + > > OMPI_MCA_btl=sm,self,openib*22:15:09* + OMPI_MCA_spml=yoda*22:15:09* + > > OMPI_MCA_memheap_mr_interleave_factor=8*22:15:09* + > > OMPI_MCA_memheap=ptmalloc*22:15:09* + > > OMPI_MCA_btl_openib_if_include=mlx4_0:1*22:15:09* + > > OMPI_MCA_rmaps_base_dist_hca=mlx4_0*22:15:09* + > > OMPI_MCA_memheap_base_hca_name=mlx4_0*22:15:09* + > > OMPI_MCA_rmaps_base_mapping_policy=dist:mlx4_0*22:15:09* + > > MXM_RDMA_PORTS=mlx4_0:1*22:15:09* + > > SHMEM_SYMMETRIC_HEAP_SIZE=1024M*22:15:09* + timeout -s SIGSEGV 3m > > > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/oshm_install2/bin/oshrun > > -np 8 > /scrap/jenkins/scrap/workspace/hpc-ompi-shmem/label/hpc-test-node/examples/hello_shmem*22:15:09* > > [vegas12:08101] *** Process received signal ****22:15:09* > > [vegas12:08101] Signal: Segmentation fault (11)*22:15:09* > > [vegas12:08101] Signal code: Address not mapped (1)*22:15:09* > > [vegas12:08101] Failing at address: (nil)*22:15:09* [vegas12:08101] [ > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15055.php >