Nope. Vader will not work on non-xpmem systems in 1.7.4. The CMR is still open for 1.7.5 (#4053. Issue like the one George reported are why I chose to hold off on the new vader until 1.7.5.
The fix is complete. At this point I am waiting on some feedback on changes to OMPI_CHECK_PACKAGE before committing. -Nathan On Mon, Jan 27, 2014 at 12:55:27PM -0800, Ralph Castain wrote: > Just FWIW: I believe that problem did indeed make it over to 1.7.4, and that > release is on "hold" pending your fix. So while I'm happy to hear about xpmem > on SGI, please do let us release 1.7.4! > > > On Jan 27, 2014, at 8:19 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > > > Yup. Has to do with not having 64-bit atomic math. The fix is complete > > but I am working on another update to enable using xpmem on SGI > > systems. I will push the changes once that is complete. > > > > -Nathan > > > > On Mon, Jan 27, 2014 at 04:00:08PM +0000, Jeff Squyres (jsquyres) wrote: > >> Is this the same issue Absoft is seeing in 32 bit builds on the trunk? > >> (i.e., 100% failure rate) > >> > >> http://mtt.open-mpi.org/index.php?do_redir=2142 > >> > >> > >> On Jan 27, 2014, at 10:38 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > >> > >>> This shouldn't be affecting 1.7.4 since neither the vader or coll/ml > >>> updates have been moved yet. As for trunk I am working on a 32-bit fix > >>> for vader and it should be in later today. I will have to track down > >>> what is going wrong the basesmuma initialization. > >>> > >>> -Nathan > >>> > >>> On Sun, Jan 26, 2014 at 04:10:29PM +0100, George Bosilca wrote: > >>>> I noticed two major issues on 32 bits machines. The first one is with > >>>> the vader BTL and the second with the selection logic in basesmuma > >>>> (bcol_basesmuma_bank_init_opti). Both versions are impacted: trunk and > >>>> 1.7. > >>>> > >>>> If I turn off vader and boll via the MCA parameters everything runs just > >>>> fine. > >>>> > >>>> George. > >>>> > >>>> ../trunk/configure --enable-debug --disable-mpi-cxx > >>>> --disable-mpi-fortran --disable-io-romio > >>>> --enable-contrib-no-build=vt,libtrace --enable-mpirun-prefix-by-default > >>>> > >>>> > >>>> - Vader generates a segfault for any application even with only 2 > >>>> processes, so this should be pretty easy to track. > >>>> > >>>> Program received signal SIGSEGV, Segmentation fault. > >>>> (gdb) bt > >>>> #0 0x00000000 in ?? () > >>>> #1 0x00ae43b3 in mca_btl_vader_poll_fifo () > >>>> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:394 > >>>> #2 0x00ae444a in mca_btl_vader_component_progress () > >>>> at ../../../../../trunk/ompi/mca/btl/vader/btl_vader_component.c:421 > >>>> #3 0x008fdb95 in opal_progress () > >>>> at ../../trunk/opal/runtime/opal_progress.c:186 > >>>> #4 0x001961bc in ompi_request_default_test_some (count=13, > >>>> requests=0xb1f01d48, outcount=0xb2afb2d0, indices=0xb1f01f60, > >>>> statuses=0xb1f02178) at ../../trunk/ompi/request/req_test.c:316 > >>>> #5 0x001def9a in PMPI_Testsome (incount=13, requests=0xb1f01d48, > >>>> outcount=0xb2afb2d0, indices=0xb1f01f60, statuses=0xb1f02178) > >>>> at ptestsome.c:81 > >>>> > >>>> > >>>> > >>>> > >>>> - basesmuma overwrite the memory. The results_array can’t be released as > >>>> the memory is corrupted. I did not have time to investigate too much but > >>>> it looks like the pload_mgmt->data_bffs either too small or somehow data > >>>> is stored outside its boundaries. > >>>> > >>>> *** glib detected *** > >>>> /home/bosilca/unstable/dplasma/trunk/build/debug/dplasma/testing/testing_dpotrf: > >>>> free(): invalid next size (fast): 0x081f0798 *** > >>>> > >>>> (gdb) bt > >>>> #0 0x00130424 in __kernel_vsyscall () > >>>> #1 0x006bfb11 in raise () from /lib/libc.so.6 > >>>> #2 0x006c13ea in abort () from /lib/libc.so.6 > >>>> #3 0x006ff9d5 in __libc_message () from /lib/libc.so.6 > >>>> #4 0x00705e31 in malloc_printerr () from /lib/libc.so.6 > >>>> #5 0x00708571 in _int_free () from /lib/libc.so.6 > >>>> #6 0x00c02d0e in bcol_basesmuma_bank_init_opti (ml_module=0x81dfe60, > >>>> bcol_module=0xb30b3008, reg_data=0x81e6698) > >>>> at > >>>> ../../../../../trunk/ompi/mca/bcol/basesmuma/bcol_basesmuma_buf_mgmt.c:472 > >>>> #7 0x00b7035f in mca_coll_ml_register_bcols (ml_module=0x81dfe60) > >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:513 > >>>> #8 0x00b70651 in ml_module_memory_initialization (ml_module=0x81dfe60) > >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:560 > >>>> #9 0x00b733fd in ml_discover_hierarchy (ml_module=0x81dfe60) > >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:1585 > >>>> #10 0x00b7786e in mca_coll_ml_comm_query (comm=0x8127da0, > >>>> priority=0xbfffe558) > >>>> at ../../../../../trunk/ompi/mca/coll/ml/coll_ml_module.c:2998 > >>>> #11 0x00202ea6 in query_2_0_0 (component=0xbc6500, comm=0x8127da0, > >>>> priority=0xbfffe558, module=0xbfffe580) > >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:375 > >>>> #12 0x00202e7f in query (component=0xbc6500, comm=0x8127da0, > >>>> priority=0xbfffe558, module=0xbfffe580) > >>>> ---Type <return> to continue, or q <return> to quit--- > >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:358 > >>>> #13 0x00202d9e in check_one_component (comm=0x8127da0, > >>>> component=0xbc6500, > >>>> module=0xbfffe580) > >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:320 > >>>> #14 0x00202bce in check_components (components=0x253d70, comm=0x8127da0) > >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:284 > >>>> #15 0x001fbbe1 in mca_coll_base_comm_select (comm=0x8127da0) > >>>> at ../../../../trunk/ompi/mca/coll/base/coll_base_comm_select.c:117 > >>>> #16 0x0019872f in ompi_mpi_init (argc=7, argv=0xbfffee74, requested=0, > >>>> provided=0xbfffe970) at ../../trunk/ompi/runtime/ompi_mpi_init.c:894 > >>>> #17 0x001c9509 in PMPI_Init (argc=0xbfffe9c0, argv=0xbfffe9c4) at > >>>> pinit.c:84 > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > >> http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
pgpIclP54iOz1.pgp
Description: PGP signature