https://github.com/open-mpi/ompi/issues/473 filed.


> On Mar 13, 2015, at 4:28 PM, Orion Poplawski <or...@cora.nwra.com> wrote:
> 
> That does appear to make it work.  So I guess the issue is in the vader btl
> somewhere.  FWIW I don't see any warning compiling the vader btl code.
> 
> On 03/13/2015 01:08 PM, George Bosilca wrote:
>> Do you have the same behavior when you disable the vader BTL ? (--mca btl 
>> ^vader).
>> 
>>  George.
>> 
>> 
>> On Fri, Mar 13, 2015 at 2:20 PM, Orion Poplawski <or...@cora.nwra.com
>> <mailto:or...@cora.nwra.com>> wrote:
>> 
>>    We currently have openmpi-1.8.4-99-20150228 built in Fedora Rawhide.  I'm 
>> now
>>    seeing crashes/hangs when running the netcdf test suite on i686.  Crashes
>>    include:
>> 
>> 
>>    [mock1:23702] *** An error occurred in MPI_Allreduce
>>    [mock1:23702] *** reported by process [3653173249 <tel:%5B3653173249>,1]
>>    [mock1:23702] *** on communicator MPI COMMUNICATOR 7 DUP FROM 6
>>    [mock1:23702] *** MPI_ERR_IN_STATUS: error code in status
>>    [mock1:23702] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
>> will
>>    now abort,
>>    [mock1:23702] ***    and potentially your MPI job)
>> 
>>    and a similar one in MPI_Bcast.
>> 
>>    Hangs (100%cpu) seem to be in opal_condition_wait() -> opal_progress() 
>> calling
>>    both mca_pml_ob1_progress and mca_btl_vader_component_progress.
>> 
>>    #0  mca_btl_vader_check_fboxes () at btl_vader_fbox.h:192
>>    #1  mca_btl_vader_component_progress () at btl_vader_component.c:694
>>    #2  0xf3971b69 in opal_progress () at runtime/opal_progress.c:187
>>    #3  0xf40b4695 in opal_condition_wait (c=<optimized out>, m=<optimized 
>> out>)
>>        at ../opal/threads/condition.h:78
>>    #4  ompi_request_default_wait_all (count=6, requests=<optimized out>,
>>    statuses=0x0)
>>        at request/req_wait.c:281
>>    #5  0xf28bb5e7 in ompi_coll_tuned_alltoall_intra_basic_linear
>>    (sbuf=sbuf@entry=0xf7a2d328,
>>        scount=scount@entry=1, sdtype=sdtype@entry=0xf4148240 <ompi_mpi_int>,
>>        rbuf=rbuf@entry=0xf7af1920, rcount=rcount@entry=1,
>>        rdtype=rdtype@entry=0xf4148240 <ompi_mpi_int>, 
>> comm=comm@entry=0xf7b051d8,
>>        module=module@entry=0xf7a2b4d0) at coll_tuned_alltoall.c:700
>>    #6  0xf28b4d08 in ompi_coll_tuned_alltoall_intra_dec_fixed 
>> (sbuf=0xf7a2d328,
>>    scount=1,
>>        sdtype=0xf4148240 <ompi_mpi_int>, rbuf=0xf7af1920, rcount=1,
>>        rdtype=0xf4148240 <ompi_mpi_int>, comm=0xf7b051d8, module=0xf7a2b4d0)
>>        at coll_tuned_decision_fixed.c:130
>>    #7  0xf40c7899 in PMPI_Alltoall (sendbuf=sendbuf@entry=0xf7a2d328,
>>        sendcount=sendcount@entry=1, sendtype=sendtype@entry=0xf4148240
>>    <ompi_mpi_int>,
>>        recvbuf=recvbuf@entry=0xf7af1920, recvcount=recvcount@entry=1,
>>        recvtype=recvtype@entry=0xf4148240 <ompi_mpi_int>, comm=0xf7b051d8) at
>>    palltoall.c:111
>>    #8  0xe9780da0 in ADIOI_Calc_others_req (fd=fd@entry=0xf7b12640,
>>    count_my_req_procs=1,
>>        count_my_req_per_proc=0xf7a2d328, my_req=0xf7b00750, nprocs=4, 
>> myrank=0,
>>        
>> count_others_req_procs_ptr=count_others_req_procs_ptr@entry=0xffbea6e8,
>>        others_req_ptr=others_req_ptr@entry=0xffbea6cc) at
>>    adio/common/ad_aggregate.c:453
>>    #9  0xe9796a14 in ADIOI_GEN_WriteStridedColl (fd=0xf7b12640, 
>> buf=0xf7aa0148,
>>    count=2440,
>>        datatype=0xf4148840 <ompi_mpi_byte>, file_ptr_type=100, offset=0,
>>    status=0xffbea8b8,
>>        error_code=0xffbea790) at adio/common/ad_write_coll.c:192
>>    #10 0xe97779e0 in MPIOI_File_write_all (fh=fh@entry=0xf7b12640,
>>    offset=offset@entry=0,
>>        file_ptr_type=file_ptr_type@entry=100, buf=buf@entry=0xf7aa0148,
>>    count=count@entry=2440,
>>        datatype=datatype@entry=0xf4148840 <ompi_mpi_byte>,
>>        myname=myname@entry=0xe97a9a1c <myname.9354> "MPI_FILE_WRITE_AT_ALL",
>>        status=status@entry=0xffbea8b8) at mpi-io/write_all.c:116
>>    #11 0xe9778176 in mca_io_romio_dist_MPI_File_write_at_all (fh=0xf7b12640,
>>        offset=offset@entry=0, buf=buf@entry=0xf7aa0148, 
>> count=count@entry=2440,
>>        datatype=datatype@entry=0xf4148840 <ompi_mpi_byte>,
>>    status=status@entry=0xffbea8b8)
>>        at mpi-io/write_atall.c:55
>>    #12 0xe9770bcc in mca_io_romio_file_write_at_all (fh=0xf7aa27c8, offset=0,
>>    buf=0xf7aa0148,
>>        count=2440, datatype=0xf4148840 <ompi_mpi_byte>, status=0xffbea8b8)
>>        at src/io_romio_file_write.c:61
>>    #13 0xf40ff3ce in PMPI_File_write_at_all (fh=0xf7aa27c8, offset=0,
>>    buf=buf@entry=0xf7aa0148,
>>        count=count@entry=2440, e=0xf4148840 <ompi_mpi_byte>,
>>    status=status@entry=0xffbea8b8)
>>        at pfile_write_at_all.c:75
>>    #14 0xf437a43c in H5FD_mpio_write (_file=_file@entry=0xf7b074a8,
>>        type=type@entry=H5FD_MEM_DRAW, dxpl_id=167772177, addr=31780,
>>    size=size@entry=2440,
>>        buf=buf@entry=0xf7aa0148) at ../../src/H5FDmpio.c:1840
>>    #15 0xf4375cd5 in H5FD_write (file=0xf7b074a8, dxpl=0xf7a47d20,
>>    type=H5FD_MEM_DRAW,
>>        addr=31780, size=size@entry=2440, buf=buf@entry=0xf7aa0148) at
>>    ../../src/H5FDint.c:245
>>    #16 0xf4360932 in H5F__accum_write (fio_info=fio_info@entry=0xffbea9d4,
>>        type=type@entry=H5FD_MEM_DRAW, addr=31780, size=size@entry=2440,
>>    buf=buf@entry=0xf7aa0148)
>>        at ../../src/H5Faccum.c:824
>>    #17 0xf436430c in H5F_block_write (f=0xf7a31860,
>>    type=type@entry=H5FD_MEM_DRAW, addr=31780,
>>        size=size@entry=2440, dxpl_id=167772177, buf=0xf7aa0148) at
>>    ../../src/H5Fio.c:170
>>    #18 0xf43413ee in H5D__mpio_select_write (io_info=0xffbeab60,
>>    type_info=0xffbeab1c,
>>        mpi_buf_count=2440, file_space=0x0, mem_space=0x0) at
>>    ../../src/H5Dmpio.c:296
>>    #19 0xf4341f33 in H5D__final_collective_io (mpi_buf_type=0xffbeaa7c,
>>    mpi_file_type=0xffbeaa78,
>>        mpi_buf_count=<optimized out>, type_info=0xffbeab1c, 
>> io_info=0xffbeab60)
>>        at ../../src/H5Dmpio.c:1444
>>    #20 H5D__inter_collective_io (mem_space=0xf7a38120, file_space=0xf7a55590,
>>        type_info=0xffbeab1c, io_info=0xffbeab60) at ../../src/H5Dmpio.c:1400
>>    #21 H5D__contig_collective_write (io_info=0xffbeab60, 
>> type_info=0xffbeab1c,
>>    nelmts=610,
>>        file_space=0xf7a55590, mem_space=0xf7a38120, fm=0xffbeace0) at
>>    ../../src/H5Dmpio.c:528
>>    #22 0xf433ae8d in H5D__write (buf=0xf7aa0148, dxpl_id=167772177,
>>    file_space=0xf7a55590,
>>        mem_space=0xf7a38120, mem_type_id=-140159600, dataset=0xf7a3eb40) at
>>    ../../src/H5Dio.c:787
>>    #23 H5D__pre_write (dset=dset@entry=0xf7a3eb40, direct_write=<optimized 
>> out>,
>>        mem_type_id=mem_type_id@entry=50331747,
>>    mem_space=mem_space@entry=0xf7a38120,
>>        file_space=0xf7a55590, dxpl_id=dxpl_id@entry=167772177,
>>    buf=buf@entry=0xf7aa0148)
>>        at ../../src/H5Dio.c:351
>>    #24 0xf433b74c in H5Dwrite (dset_id=83886085, mem_type_id=50331747,
>>        mem_space_id=mem_space_id@entry=67108867,
>>    file_space_id=file_space_id@entry=67108866,
>>        dxpl_id=dxpl_id@entry=167772177, buf=buf@entry=0xf7aa0148) at
>>    ../../src/H5Dio.c:270
>>    #25 0xf466b603 in nc4_put_vara (nc=0xf7a05c58, ncid=ncid@entry=65536,
>>    varid=varid@entry=3,
>>        startp=startp@entry=0xffbf6a08, countp=countp@entry=0xffbf6a10,
>>        mem_nc_type=mem_nc_type@entry=5, is_long=is_long@entry=0,
>>    data=data@entry=0xf7a07c40)
>>        at ../../libsrc4/nc4hdf.c:788
>>    #26 0xf4673c55 in nc4_put_vara_tc (mem_type_is_long=0, op=0xf7a07c40,
>>    countp=0xffbf6a10,
>>        startp=0xffbf6a08, mem_type=5, varid=3, ncid=65536) at
>>    ../../libsrc4/nc4var.c:1429
>>    #27 NC4_put_vara (ncid=65536, varid=3, startp=0xffbf6a08, 
>> countp=0xffbf6a10,
>>    op=0xf7a07c40,
>>        memtype=5) at ../../libsrc4/nc4var.c:1565
>>    #28 0xf460a377 in NC_put_vara (ncid=ncid@entry=65536, varid=varid@entry=3,
>>        start=start@entry=0xffbf6a08, edges=edges@entry=0xffbf6a10,
>>    value=value@entry=0xf7a07c40,
>>        memtype=memtype@entry=5) at ../../libdispatch/dvarput.c:79
>>    #29 0xf460b541 in nc_put_vara_float (ncid=65536, varid=3, 
>> startp=0xffbf6a08,
>>        countp=0xffbf6a10, op=0xf7a07c40) at ../../libdispatch/dvarput.c:655
>>    #30 0xf77d06ed in test_pio_2d (cache_size=67108864, facc_type=8192,
>>    access_flag=1,
>>        comm=0xf414d800 <ompi_mpi_comm_world>, info=0xf4154240
>>    <ompi_mpi_info_null>, mpi_size=4,
>>        mpi_rank=0, chunk_size=0xffbf76f4) at ../../nc_test4/tst_nc4perf.c:96
>>    #31 0xf77cfdb1 in main (argc=1, argv=0xffbf7804) at
>>    ../../nc_test4/tst_nc4perf.c:299
>> 
>> 
>>    Any suggests as to where to look next would be greatly appreciated.
>> 
>>    --
>>    Orion Poplawski
>>    Technical Manager                     303-415-9701 x222
>>    <tel:303-415-9701%20x222>
>>    NWRA, Boulder/CoRA Office             FAX: 303-415-9702 <tel:303-415-9702>
>>    3380 Mitchell Lane                       or...@nwra.com
>>    <mailto:or...@nwra.com>
>>    Boulder, CO 80301                   http://www.nwra.com
>>    _______________________________________________
>>    devel mailing list
>>    de...@open-mpi.org <mailto:de...@open-mpi.org>
>>    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>    Link to this post:
>>    http://www.open-mpi.org/community/lists/devel/2015/03/17131.php
>> 
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/03/17132.php
>> 
> 
> 
> -- 
> Orion Poplawski
> Technical Manager                     303-415-9701 x222
> NWRA, Boulder/CoRA Office             FAX: 303-415-9702
> 3380 Mitchell Lane                       or...@nwra.com
> Boulder, CO 80301                   http://www.nwra.com
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17133.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to