That does appear to make it work.  So I guess the issue is in the vader btl
somewhere.  FWIW I don't see any warning compiling the vader btl code.

On 03/13/2015 01:08 PM, George Bosilca wrote:
> Do you have the same behavior when you disable the vader BTL ? (--mca btl 
> ^vader).
> 
>   George.
> 
> 
> On Fri, Mar 13, 2015 at 2:20 PM, Orion Poplawski <or...@cora.nwra.com
> <mailto:or...@cora.nwra.com>> wrote:
> 
>     We currently have openmpi-1.8.4-99-20150228 built in Fedora Rawhide.  I'm 
> now
>     seeing crashes/hangs when running the netcdf test suite on i686.  Crashes
>     include:
> 
> 
>     [mock1:23702] *** An error occurred in MPI_Allreduce
>     [mock1:23702] *** reported by process [3653173249 <tel:%5B3653173249>,1]
>     [mock1:23702] *** on communicator MPI COMMUNICATOR 7 DUP FROM 6
>     [mock1:23702] *** MPI_ERR_IN_STATUS: error code in status
>     [mock1:23702] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> will
>     now abort,
>     [mock1:23702] ***    and potentially your MPI job)
> 
>     and a similar one in MPI_Bcast.
> 
>     Hangs (100%cpu) seem to be in opal_condition_wait() -> opal_progress() 
> calling
>     both mca_pml_ob1_progress and mca_btl_vader_component_progress.
> 
>     #0  mca_btl_vader_check_fboxes () at btl_vader_fbox.h:192
>     #1  mca_btl_vader_component_progress () at btl_vader_component.c:694
>     #2  0xf3971b69 in opal_progress () at runtime/opal_progress.c:187
>     #3  0xf40b4695 in opal_condition_wait (c=<optimized out>, m=<optimized 
> out>)
>         at ../opal/threads/condition.h:78
>     #4  ompi_request_default_wait_all (count=6, requests=<optimized out>,
>     statuses=0x0)
>         at request/req_wait.c:281
>     #5  0xf28bb5e7 in ompi_coll_tuned_alltoall_intra_basic_linear
>     (sbuf=sbuf@entry=0xf7a2d328,
>         scount=scount@entry=1, sdtype=sdtype@entry=0xf4148240 <ompi_mpi_int>,
>         rbuf=rbuf@entry=0xf7af1920, rcount=rcount@entry=1,
>         rdtype=rdtype@entry=0xf4148240 <ompi_mpi_int>, 
> comm=comm@entry=0xf7b051d8,
>         module=module@entry=0xf7a2b4d0) at coll_tuned_alltoall.c:700
>     #6  0xf28b4d08 in ompi_coll_tuned_alltoall_intra_dec_fixed 
> (sbuf=0xf7a2d328,
>     scount=1,
>         sdtype=0xf4148240 <ompi_mpi_int>, rbuf=0xf7af1920, rcount=1,
>         rdtype=0xf4148240 <ompi_mpi_int>, comm=0xf7b051d8, module=0xf7a2b4d0)
>         at coll_tuned_decision_fixed.c:130
>     #7  0xf40c7899 in PMPI_Alltoall (sendbuf=sendbuf@entry=0xf7a2d328,
>         sendcount=sendcount@entry=1, sendtype=sendtype@entry=0xf4148240
>     <ompi_mpi_int>,
>         recvbuf=recvbuf@entry=0xf7af1920, recvcount=recvcount@entry=1,
>         recvtype=recvtype@entry=0xf4148240 <ompi_mpi_int>, comm=0xf7b051d8) at
>     palltoall.c:111
>     #8  0xe9780da0 in ADIOI_Calc_others_req (fd=fd@entry=0xf7b12640,
>     count_my_req_procs=1,
>         count_my_req_per_proc=0xf7a2d328, my_req=0xf7b00750, nprocs=4, 
> myrank=0,
>         
> count_others_req_procs_ptr=count_others_req_procs_ptr@entry=0xffbea6e8,
>         others_req_ptr=others_req_ptr@entry=0xffbea6cc) at
>     adio/common/ad_aggregate.c:453
>     #9  0xe9796a14 in ADIOI_GEN_WriteStridedColl (fd=0xf7b12640, 
> buf=0xf7aa0148,
>     count=2440,
>         datatype=0xf4148840 <ompi_mpi_byte>, file_ptr_type=100, offset=0,
>     status=0xffbea8b8,
>         error_code=0xffbea790) at adio/common/ad_write_coll.c:192
>     #10 0xe97779e0 in MPIOI_File_write_all (fh=fh@entry=0xf7b12640,
>     offset=offset@entry=0,
>         file_ptr_type=file_ptr_type@entry=100, buf=buf@entry=0xf7aa0148,
>     count=count@entry=2440,
>         datatype=datatype@entry=0xf4148840 <ompi_mpi_byte>,
>         myname=myname@entry=0xe97a9a1c <myname.9354> "MPI_FILE_WRITE_AT_ALL",
>         status=status@entry=0xffbea8b8) at mpi-io/write_all.c:116
>     #11 0xe9778176 in mca_io_romio_dist_MPI_File_write_at_all (fh=0xf7b12640,
>         offset=offset@entry=0, buf=buf@entry=0xf7aa0148, 
> count=count@entry=2440,
>         datatype=datatype@entry=0xf4148840 <ompi_mpi_byte>,
>     status=status@entry=0xffbea8b8)
>         at mpi-io/write_atall.c:55
>     #12 0xe9770bcc in mca_io_romio_file_write_at_all (fh=0xf7aa27c8, offset=0,
>     buf=0xf7aa0148,
>         count=2440, datatype=0xf4148840 <ompi_mpi_byte>, status=0xffbea8b8)
>         at src/io_romio_file_write.c:61
>     #13 0xf40ff3ce in PMPI_File_write_at_all (fh=0xf7aa27c8, offset=0,
>     buf=buf@entry=0xf7aa0148,
>         count=count@entry=2440, e=0xf4148840 <ompi_mpi_byte>,
>     status=status@entry=0xffbea8b8)
>         at pfile_write_at_all.c:75
>     #14 0xf437a43c in H5FD_mpio_write (_file=_file@entry=0xf7b074a8,
>         type=type@entry=H5FD_MEM_DRAW, dxpl_id=167772177, addr=31780,
>     size=size@entry=2440,
>         buf=buf@entry=0xf7aa0148) at ../../src/H5FDmpio.c:1840
>     #15 0xf4375cd5 in H5FD_write (file=0xf7b074a8, dxpl=0xf7a47d20,
>     type=H5FD_MEM_DRAW,
>         addr=31780, size=size@entry=2440, buf=buf@entry=0xf7aa0148) at
>     ../../src/H5FDint.c:245
>     #16 0xf4360932 in H5F__accum_write (fio_info=fio_info@entry=0xffbea9d4,
>         type=type@entry=H5FD_MEM_DRAW, addr=31780, size=size@entry=2440,
>     buf=buf@entry=0xf7aa0148)
>         at ../../src/H5Faccum.c:824
>     #17 0xf436430c in H5F_block_write (f=0xf7a31860,
>     type=type@entry=H5FD_MEM_DRAW, addr=31780,
>         size=size@entry=2440, dxpl_id=167772177, buf=0xf7aa0148) at
>     ../../src/H5Fio.c:170
>     #18 0xf43413ee in H5D__mpio_select_write (io_info=0xffbeab60,
>     type_info=0xffbeab1c,
>         mpi_buf_count=2440, file_space=0x0, mem_space=0x0) at
>     ../../src/H5Dmpio.c:296
>     #19 0xf4341f33 in H5D__final_collective_io (mpi_buf_type=0xffbeaa7c,
>     mpi_file_type=0xffbeaa78,
>         mpi_buf_count=<optimized out>, type_info=0xffbeab1c, 
> io_info=0xffbeab60)
>         at ../../src/H5Dmpio.c:1444
>     #20 H5D__inter_collective_io (mem_space=0xf7a38120, file_space=0xf7a55590,
>         type_info=0xffbeab1c, io_info=0xffbeab60) at ../../src/H5Dmpio.c:1400
>     #21 H5D__contig_collective_write (io_info=0xffbeab60, 
> type_info=0xffbeab1c,
>     nelmts=610,
>         file_space=0xf7a55590, mem_space=0xf7a38120, fm=0xffbeace0) at
>     ../../src/H5Dmpio.c:528
>     #22 0xf433ae8d in H5D__write (buf=0xf7aa0148, dxpl_id=167772177,
>     file_space=0xf7a55590,
>         mem_space=0xf7a38120, mem_type_id=-140159600, dataset=0xf7a3eb40) at
>     ../../src/H5Dio.c:787
>     #23 H5D__pre_write (dset=dset@entry=0xf7a3eb40, direct_write=<optimized 
> out>,
>         mem_type_id=mem_type_id@entry=50331747,
>     mem_space=mem_space@entry=0xf7a38120,
>         file_space=0xf7a55590, dxpl_id=dxpl_id@entry=167772177,
>     buf=buf@entry=0xf7aa0148)
>         at ../../src/H5Dio.c:351
>     #24 0xf433b74c in H5Dwrite (dset_id=83886085, mem_type_id=50331747,
>         mem_space_id=mem_space_id@entry=67108867,
>     file_space_id=file_space_id@entry=67108866,
>         dxpl_id=dxpl_id@entry=167772177, buf=buf@entry=0xf7aa0148) at
>     ../../src/H5Dio.c:270
>     #25 0xf466b603 in nc4_put_vara (nc=0xf7a05c58, ncid=ncid@entry=65536,
>     varid=varid@entry=3,
>         startp=startp@entry=0xffbf6a08, countp=countp@entry=0xffbf6a10,
>         mem_nc_type=mem_nc_type@entry=5, is_long=is_long@entry=0,
>     data=data@entry=0xf7a07c40)
>         at ../../libsrc4/nc4hdf.c:788
>     #26 0xf4673c55 in nc4_put_vara_tc (mem_type_is_long=0, op=0xf7a07c40,
>     countp=0xffbf6a10,
>         startp=0xffbf6a08, mem_type=5, varid=3, ncid=65536) at
>     ../../libsrc4/nc4var.c:1429
>     #27 NC4_put_vara (ncid=65536, varid=3, startp=0xffbf6a08, 
> countp=0xffbf6a10,
>     op=0xf7a07c40,
>         memtype=5) at ../../libsrc4/nc4var.c:1565
>     #28 0xf460a377 in NC_put_vara (ncid=ncid@entry=65536, varid=varid@entry=3,
>         start=start@entry=0xffbf6a08, edges=edges@entry=0xffbf6a10,
>     value=value@entry=0xf7a07c40,
>         memtype=memtype@entry=5) at ../../libdispatch/dvarput.c:79
>     #29 0xf460b541 in nc_put_vara_float (ncid=65536, varid=3, 
> startp=0xffbf6a08,
>         countp=0xffbf6a10, op=0xf7a07c40) at ../../libdispatch/dvarput.c:655
>     #30 0xf77d06ed in test_pio_2d (cache_size=67108864, facc_type=8192,
>     access_flag=1,
>         comm=0xf414d800 <ompi_mpi_comm_world>, info=0xf4154240
>     <ompi_mpi_info_null>, mpi_size=4,
>         mpi_rank=0, chunk_size=0xffbf76f4) at ../../nc_test4/tst_nc4perf.c:96
>     #31 0xf77cfdb1 in main (argc=1, argv=0xffbf7804) at
>     ../../nc_test4/tst_nc4perf.c:299
> 
> 
>     Any suggests as to where to look next would be greatly appreciated.
> 
>     --
>     Orion Poplawski
>     Technical Manager                     303-415-9701 x222
>     <tel:303-415-9701%20x222>
>     NWRA, Boulder/CoRA Office             FAX: 303-415-9702 <tel:303-415-9702>
>     3380 Mitchell Lane                       or...@nwra.com
>     <mailto:or...@nwra.com>
>     Boulder, CO 80301                   http://www.nwra.com
>     _______________________________________________
>     devel mailing list
>     de...@open-mpi.org <mailto:de...@open-mpi.org>
>     Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>     Link to this post:
>     http://www.open-mpi.org/community/lists/devel/2015/03/17131.php
> 
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17132.php
> 


-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       or...@nwra.com
Boulder, CO 80301                   http://www.nwra.com

Reply via email to