I opened a github issue to track this - https://github.com/open-mpi/ompi/issues/383
--Nysal On Fri, Feb 6, 2015 at 11:36 AM, Nysal Jan K A <jny...@gmail.com> wrote: > It seems the ompi_free_list_init() in libnbc_open() failed for some > reason. That would explain why mca_coll_libnbc_component.active_requests is > not initialized and hence crash in libnbc_close(). > > This might help, but still doesn't explain why the free list > initialization failed: > diff --git a/ompi/mca/coll/libnbc/coll_libnbc_component.c > b/ompi/mca/coll/libnbc/coll_libnbc_component.c > index 1a2a81a..2d7b82c 100644 > --- a/ompi/mca/coll/libnbc/coll_libnbc_component.c > +++ b/ompi/mca/coll/libnbc/coll_libnbc_component.c > @@ -88,6 +88,7 @@ libnbc_open(void) > int ret; > > OBJ_CONSTRUCT(&mca_coll_libnbc_component.requests, ompi_free_list_t); > + OBJ_CONSTRUCT(&mca_coll_libnbc_component.active_requests, > opal_list_t); > ret = ompi_free_list_init(&mca_coll_libnbc_component.requests, > sizeof(ompi_coll_libnbc_request_t), > OBJ_CLASS(ompi_coll_libnbc_request_t), > @@ -97,7 +98,6 @@ libnbc_open(void) > NULL); > if (OMPI_SUCCESS != ret) return ret; > > - OBJ_CONSTRUCT(&mca_coll_libnbc_component.active_requests, > opal_list_t); > /* note: active comms is the number of communicators who have had > a non-blocking collective started */ > mca_coll_libnbc_component.active_comms = 0; > > It looks like an issue with detecting the proper L2 cache line size on > power. > I'll take a look over the weekend. > > Regards > --Nysal > > On Tue, Feb 3, 2015 at 8:58 AM, Paul Hargrove <phhargr...@lbl.gov> wrote: > >> On a Linux/PPC64 system I see the failure below from a build of the >> current master tarball. >> This build was configured with >> --prefix=... --enable-debug \ >> CFLAGS=-m64 --with-wrapper-cflags=-m64 \ >> CXXFLAGS=-m64 --with-wrapper-cxxflags=-m64 \ >> FCFLAGS=-m64 --with-wrapper-fcflags=-m64 >> >> I am not sure if putting "-m64" in both the *FLAGS and wrapper flags is >> required, but am confident the error is unrelated. >> >> -Paul >> >> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >> [pcp-k-422:08534] mca: base: components_open: component coll / libnbc >> open function failed >> ring_c: >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/openmpi-dev-803-g5919b63/ompi/mca/coll/libnbc/coll_libnbc_component.c:118: >> libnbc_close: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == >> ((opal_object_t *) >> (&mca_coll_libnbc_component.active_requests))->obj_magic_id' failed. >> [pcp-k-422:08534] *** Process received signal *** >> [pcp-k-422:08534] Signal: Aborted (6) >> [pcp-k-422:08534] Signal code: (-6) >> [pcp-k-422:08534] [ 0] [0x3fff8bd90478] >> [pcp-k-422:08534] [ 1] /lib64/libc.so.6(gsignal-0x155030)[0x3fff8b9fc510] >> [pcp-k-422:08534] [ 2] /lib64/libc.so.6(abort-0x150094)[0x3fff8ba01be4] >> [pcp-k-422:08534] [ 3] /lib64/libc.so.6(+0x572ac)[0x3fff8b9f22ac] >> [pcp-k-422:08534] [ 4] >> /lib64/libc.so.6(__assert_fail-0x15ddac)[0x3fff8b9f239c] >> [pcp-k-422:08534] [ 5] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/openmpi/mca_coll_libnbc.so(+0x9088)[0x3fff8a190088] >> [pcp-k-422:08534] [ 6] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_component_close-0xed5e8)[0x3fff8b758308] >> [pcp-k-422:08534] [ 7] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(+0xa9c5c)[0x3fff8b757c5c] >> [pcp-k-422:08534] [ 8] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_components_open-0xee088)[0x3fff8b757778] >> [pcp-k-422:08534] [ 9] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_open-0xdc3f8)[0x3fff8b76a620] >> [pcp-k-422:08534] [10] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(ompi_mpi_init-0x12d5fc)[0x3fff8bc33d14] >> [pcp-k-422:08534] [11] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(MPI_Init-0xe4734)[0x3fff8bc821bc] >> [pcp-k-422:08534] [12] examples/ring_c[0x10000a20] >> [pcp-k-422:08534] [13] /lib64/libc.so.6(+0x47b6c)[0x3fff8b9e2b6c] >> [pcp-k-422:08534] [14] >> /lib64/libc.so.6(__libc_start_main-0x16caf8)[0x3fff8b9e2d98] >> [pcp-k-422:08534] *** End of error message *** >> [pcp-k-422:08535] mca: base: components_open: component coll / libnbc >> open function failed >> ring_c: >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/openmpi-dev-803-g5919b63/ompi/mca/coll/libnbc/coll_libnbc_component.c:118: >> libnbc_close: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == >> ((opal_object_t *) >> (&mca_coll_libnbc_component.active_requests))->obj_magic_id' failed. >> [pcp-k-422:08535] *** Process received signal *** >> [pcp-k-422:08535] Signal: Aborted (6) >> [pcp-k-422:08535] Signal code: (-6) >> [pcp-k-422:08535] [ 0] [0x3fff99e30478] >> [pcp-k-422:08535] [ 1] /lib64/libc.so.6(gsignal-0x155030)[0x3fff99a9c510] >> [pcp-k-422:08535] [ 2] /lib64/libc.so.6(abort-0x150094)[0x3fff99aa1be4] >> [pcp-k-422:08535] [ 3] /lib64/libc.so.6(+0x572ac)[0x3fff99a922ac] >> [pcp-k-422:08535] [ 4] >> /lib64/libc.so.6(__assert_fail-0x15ddac)[0x3fff99a9239c] >> [pcp-k-422:08535] [ 5] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/openmpi/mca_coll_libnbc.so(+0x9088)[0x3fff98230088] >> [pcp-k-422:08535] [ 6] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_component_close-0xed5e8)[0x3fff997f8308] >> [pcp-k-422:08535] [ 7] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(+0xa9c5c)[0x3fff997f7c5c] >> [pcp-k-422:08535] [ 8] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_components_open-0xee088)[0x3fff997f7778] >> [pcp-k-422:08535] [ 9] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_open-0xdc3f8)[0x3fff9980a620] >> [pcp-k-422:08535] [10] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(ompi_mpi_init-0x12d5fc)[0x3fff99cd3d14] >> [pcp-k-422:08535] [11] >> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(MPI_Init-0xe4734)[0x3fff99d221bc] >> [pcp-k-422:08535] [12] examples/ring_c[0x10000a20] >> [pcp-k-422:08535] [13] /lib64/libc.so.6(+0x47b6c)[0x3fff99a82b6c] >> [pcp-k-422:08535] [14] >> /lib64/libc.so.6(__libc_start_main-0x16caf8)[0x3fff99a82d98] >> [pcp-k-422:08535] *** End of error message *** >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 1 with PID 0 on node pcp-k-422 exited on >> signal 6 (Aborted). >> -------------------------------------------------------------------------- >> >> >> >> -- >> Paul H. Hargrove phhargr...@lbl.gov >> Computer Languages & Systems Software (CLaSS) Group >> Computer Science Department Tel: +1-510-495-2352 >> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16902.php >> > >