Just FWIW: I believe that problem did indeed make it over to 1.7.4, and that release is on "hold" pending your fix. So while I'm happy to hear about xpmem on SGI, please do let us release 1.7.4!
On Jan 27, 2014, at 8:19 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > Yup. Has to do with not having 64-bit atomic math. The fix is complete > but I am working on another update to enable using xpmem on SGI > systems. I will push the changes once that is complete. > > -Nathan > > On Mon, Jan 27, 2014 at 04:00:08PM +0000, Jeff Squyres (jsquyres) wrote: >> Is this the same issue Absoft is seeing in 32 bit builds on the trunk? >> (i.e., 100% failure rate) >> >> http://mtt.open-mpi.org/index.php?do_redir=2142 >> >> >> On Jan 27, 2014, at 10:38 AM, Nathan Hjelm <hje...@lanl.gov> wrote: >> >>> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml >>> updates have been moved yet. As for trunk I am working on a 32-bit fix >>> for vader and it should be in later today. I will have to track down >>> what is going wrong the basesmuma initialization. >>> >>> -Nathan >>> >>> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote: >>>> I noticed two major issues on 32 bits machines. The first one is with the >>>> vader BTL and the second with the selection logic in basesmuma >>>> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and 1.7. >>>> >>>> If I turn off vader and boll via the MCA parameters everything runs just >>>> fine. >>>> >>>> George. >>>> >>>> ../trunk/configure --enable-debug --disable-mpi-cxx --disable-mpi-fortran >>>> --disable-io-romio --enable-contrib-no-build=vt,libtrace >>>> --enable-mpirun-prefix-by-default >>>> >>>> >>>> - Vader generates a segfault for any application even with only 2 >>>> processes, so this should be pretty easy to track. >>>> >>>> Program received signal SIGSEGV, Segmentation fault. >>>> (gdb) bt >>>> #0 0x00000000 in ?? () >>>> #1 0x00ae43b3 in mca_btl_vader_poll_fifo () >>>> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394 >>>> #2 0x00ae444a in mca_btl_vader_component_progress () >>>> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421 >>>> #3 0x008fdb95 in opal_progress () >>>> at ../../trunk/opal/runtime/opal_progress.c:186 >>>> #4 0x001961bc in ompi_request_default_test_some (count=13, >>>> requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, >>>> statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316 >>>> #5 0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, >>>> outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178) >>>> at ptestsome.c:81 >>>> >>>> >>>> >>>> >>>> - basesmuma overwrite the memory. The results_array can’t be released as >>>> the memory is corrupted. I did not have time to investigate too much but >>>> it looks like the pload_mgmt->data_bffs either too small or somehow data >>>> is stored outside its boundaries. >>>> >>>> *** glib detected *** >>>> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf: >>>> free(): invalid next size (fast): 0x081f0798 *** >>>> >>>> (gdb) bt >>>> #0 0x00130424 in __kernel_vsyscall () >>>> #1 0x006bfb11 in raise () from /lib/libc.so.6 >>>> #2 0x006c13ea in abort () from /lib/libc.so.6 >>>> #3 0x006ff9d5 in __libc_message () from /lib/libc.so.6 >>>> #4 0x00705e31 in malloc_printerr () from /lib/libc.so.6 >>>> #5 0x00708571 in _int_free () from /lib/libc.so.6 >>>> #6 0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, >>>> bcol_module=0xb30b3008, reg_data=0x81e6698) >>>> at >>>> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472 >>>> #7 0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60) >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513 >>>> #8 0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60) >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560 >>>> #9 0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60) >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585 >>>> #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, >>>> priority=0xbfffe558) >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998 >>>> #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0, >>>> priority=0xbfffe558, module=0xbfffe580) >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375 >>>> #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0, >>>> priority=0xbfffe558, module=0xbfffe580) >>>> ---Type <return> to continue, or q <return> to quit--- >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358 >>>> #13 0x00202d9e in check_one_component (comm=0x8127da0, component=0xbc6500, >>>> module=0xbfffe580) >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320 >>>> #14 0x00202bce in check_components (components=0x253d70, comm=0x8127da0) >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:284 >>>> #15 0x001fbbe1 in mca_coll_base_comm_select (comm=0x8127da0) >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:117 >>>> #16 0x0019872f in ompi_mpi_init (argc=7, argv=0xbfffee74, requested=0, >>>> provided=0xbfffe970) at ../../trunk/ompi/runtime/ompi_mpi_init.c:894 >>>> #17 0x001c9509 in PMPI_Init (argc=0xbfffe9c0, argv=0xbfffe9c4) at >>>> pinit.c:84 >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel