I opened a github issue to track this -
https://github.com/open-mpi/ompi/issues/383

--Nysal

On Fri, Feb 6, 2015 at 11:36 AM, Nysal Jan K A <jny...@gmail.com> wrote:

> It seems the ompi_free_list_init() in libnbc_open() failed for some
> reason. That would explain why mca_coll_libnbc_component.active_requests is
> not initialized and hence crash in libnbc_close().
>
> This might help, but still doesn't explain why the free list
> initialization failed:
> diff --git a/ompi/mca/coll/libnbc/coll_libnbc_component.c
> b/ompi/mca/coll/libnbc/coll_libnbc_component.c
> index 1a2a81a..2d7b82c 100644
> --- a/ompi/mca/coll/libnbc/coll_libnbc_component.c
> +++ b/ompi/mca/coll/libnbc/coll_libnbc_component.c
> @@ -88,6 +88,7 @@ libnbc_open(void)
>      int ret;
>
>      OBJ_CONSTRUCT(&mca_coll_libnbc_component.requests, ompi_free_list_t);
> +    OBJ_CONSTRUCT(&mca_coll_libnbc_component.active_requests,
> opal_list_t);
>      ret = ompi_free_list_init(&mca_coll_libnbc_component.requests,
>                                sizeof(ompi_coll_libnbc_request_t),
>                                OBJ_CLASS(ompi_coll_libnbc_request_t),
> @@ -97,7 +98,6 @@ libnbc_open(void)
>                                NULL);
>      if (OMPI_SUCCESS != ret) return ret;
>
> -    OBJ_CONSTRUCT(&mca_coll_libnbc_component.active_requests,
> opal_list_t);
>      /* note: active comms is the number of communicators who have had
>         a non-blocking collective started */
>      mca_coll_libnbc_component.active_comms = 0;
>
> It looks like an issue with detecting the proper L2 cache line size on
> power.
> I'll take a look over the weekend.
>
> Regards
> --Nysal
>
> On Tue, Feb 3, 2015 at 8:58 AM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>> On a Linux/PPC64 system I see the failure below from a build of the
>> current master tarball.
>> This build was configured with
>>    --prefix=... --enable-debug \
>>   CFLAGS=-m64 --with-wrapper-cflags=-m64 \
>>   CXXFLAGS=-m64 --with-wrapper-cxxflags=-m64 \
>>   FCFLAGS=-m64 --with-wrapper-fcflags=-m64
>>
>> I am not sure if putting "-m64" in both the *FLAGS and wrapper flags is
>> required, but am confident the error is unrelated.
>>
>> -Paul
>>
>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>> [pcp-k-422:08534] mca: base: components_open: component coll / libnbc
>> open function failed
>> ring_c:
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/openmpi-dev-803-g5919b63/ompi/mca/coll/libnbc/coll_libnbc_component.c:118:
>> libnbc_close: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
>> ((opal_object_t *)
>> (&mca_coll_libnbc_component.active_requests))->obj_magic_id' failed.
>> [pcp-k-422:08534] *** Process received signal ***
>> [pcp-k-422:08534] Signal: Aborted (6)
>> [pcp-k-422:08534] Signal code:  (-6)
>> [pcp-k-422:08534] [ 0] [0x3fff8bd90478]
>> [pcp-k-422:08534] [ 1] /lib64/libc.so.6(gsignal-0x155030)[0x3fff8b9fc510]
>> [pcp-k-422:08534] [ 2] /lib64/libc.so.6(abort-0x150094)[0x3fff8ba01be4]
>> [pcp-k-422:08534] [ 3] /lib64/libc.so.6(+0x572ac)[0x3fff8b9f22ac]
>> [pcp-k-422:08534] [ 4]
>> /lib64/libc.so.6(__assert_fail-0x15ddac)[0x3fff8b9f239c]
>> [pcp-k-422:08534] [ 5]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/openmpi/mca_coll_libnbc.so(+0x9088)[0x3fff8a190088]
>> [pcp-k-422:08534] [ 6]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_component_close-0xed5e8)[0x3fff8b758308]
>> [pcp-k-422:08534] [ 7]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(+0xa9c5c)[0x3fff8b757c5c]
>> [pcp-k-422:08534] [ 8]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_components_open-0xee088)[0x3fff8b757778]
>> [pcp-k-422:08534] [ 9]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_open-0xdc3f8)[0x3fff8b76a620]
>> [pcp-k-422:08534] [10]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(ompi_mpi_init-0x12d5fc)[0x3fff8bc33d14]
>> [pcp-k-422:08534] [11]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(MPI_Init-0xe4734)[0x3fff8bc821bc]
>> [pcp-k-422:08534] [12] examples/ring_c[0x10000a20]
>> [pcp-k-422:08534] [13] /lib64/libc.so.6(+0x47b6c)[0x3fff8b9e2b6c]
>> [pcp-k-422:08534] [14]
>> /lib64/libc.so.6(__libc_start_main-0x16caf8)[0x3fff8b9e2d98]
>> [pcp-k-422:08534] *** End of error message ***
>> [pcp-k-422:08535] mca: base: components_open: component coll / libnbc
>> open function failed
>> ring_c:
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/openmpi-dev-803-g5919b63/ompi/mca/coll/libnbc/coll_libnbc_component.c:118:
>> libnbc_close: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) ==
>> ((opal_object_t *)
>> (&mca_coll_libnbc_component.active_requests))->obj_magic_id' failed.
>> [pcp-k-422:08535] *** Process received signal ***
>> [pcp-k-422:08535] Signal: Aborted (6)
>> [pcp-k-422:08535] Signal code:  (-6)
>> [pcp-k-422:08535] [ 0] [0x3fff99e30478]
>> [pcp-k-422:08535] [ 1] /lib64/libc.so.6(gsignal-0x155030)[0x3fff99a9c510]
>> [pcp-k-422:08535] [ 2] /lib64/libc.so.6(abort-0x150094)[0x3fff99aa1be4]
>> [pcp-k-422:08535] [ 3] /lib64/libc.so.6(+0x572ac)[0x3fff99a922ac]
>> [pcp-k-422:08535] [ 4]
>> /lib64/libc.so.6(__assert_fail-0x15ddac)[0x3fff99a9239c]
>> [pcp-k-422:08535] [ 5]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/openmpi/mca_coll_libnbc.so(+0x9088)[0x3fff98230088]
>> [pcp-k-422:08535] [ 6]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_component_close-0xed5e8)[0x3fff997f8308]
>> [pcp-k-422:08535] [ 7]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(+0xa9c5c)[0x3fff997f7c5c]
>> [pcp-k-422:08535] [ 8]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_components_open-0xee088)[0x3fff997f7778]
>> [pcp-k-422:08535] [ 9]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libopen-pal.so.0(mca_base_framework_open-0xdc3f8)[0x3fff9980a620]
>> [pcp-k-422:08535] [10]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(ompi_mpi_init-0x12d5fc)[0x3fff99cd3d14]
>> [pcp-k-422:08535] [11]
>> /home/phargrov/OMPI/openmpi-master-linux-ppc64/INST/lib/libmpi.so.0(MPI_Init-0xe4734)[0x3fff99d221bc]
>> [pcp-k-422:08535] [12] examples/ring_c[0x10000a20]
>> [pcp-k-422:08535] [13] /lib64/libc.so.6(+0x47b6c)[0x3fff99a82b6c]
>> [pcp-k-422:08535] [14]
>> /lib64/libc.so.6(__libc_start_main-0x16caf8)[0x3fff99a82d98]
>> [pcp-k-422:08535] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 1 with PID 0 on node pcp-k-422 exited on
>> signal 6 (Aborted).
>> --------------------------------------------------------------------------
>>
>>
>>
>> --
>> Paul H. Hargrove                          phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department               Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/02/16902.php
>>
>
>

Reply via email to