Also, I should add that the HDF5 files appear to be written properly when run under "mpiexec -n 1", and valgrind doesn't report any bogus malloc/free calls or wild pointers. So I don't think it's a problem with how I've massaged the H5Z plugins or the PETSc code.
On Wed, Nov 8, 2017 at 12:22 PM, Michael K. Edwards <m.k.edwa...@gmail.com> wrote: > It's not even clear to me yet whether this is the same dataset that > triggered the assert. Working on getting complete details. But FWIW > the PETSc code does not call H5Sselect_none(). It calls > H5Sselect_hyperslab() in all ranks, and that's why the ranks in which > the slice is zero columns wide hit the "empty sel_chunks" pathway I > added to H5D__create_chunk_mem_map_hyper(). > > > On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards > <m.k.edwa...@gmail.com> wrote: >> Thanks, Jordan. I recognize that this is very recent feature work and >> my goal is to help push it forward. >> >> My current use case is relatively straightforward, though there are a >> couple of layers on top of HDF5 itself. The problem can be reproduced >> by building PETSc 3.8.1 against libraries built from the develop >> branch of HDF5, adding in the H5Dset_filter() calls, and running an >> example that exercises them. (I'm using >> src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to >> induce HDF5 writes.) If you want, I can supply full details for you >> to reproduce it locally, or I can do any experiments you'd like me to >> within this setup. (It also involves patches to the out-of-tree H5Z >> plugins to make them use H5MM_malloc/H5MM_xfree rather than raw >> malloc/free, which in turn involves exposing H5MMprivate.h to the >> plugins. Is this something you've solved in a different way?) >> >> >> On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson >> <jhender...@hdfgroup.org> wrote: >>> Hi Michael, >>> >>> >>> during the design phase of this feature I tried to both account for and test >>> the case where some of the writers do not have any data to contribute. >>> However, it seems like your use case falls outside of what I have tested >>> (perhaps I have not used enough ranks?). In particular my test cases were >>> small and simply had some of the ranks call H5Sselect_none(), which doesn't >>> seem to trigger this particular assertion failure. Is this how you're >>> approaching these particular ranks in your code or is there a different way >>> you are having them participate in the write operation? >>> >>> >>> As for the hanging issue, it looks as though rank 0 is waiting to receive >>> some modification data from another rank for a particular chunk. Whether or >>> not there is actually valid data that rank 0 should be waiting for, I cannot >>> easily tell without being able to trace it through. As the other ranks have >>> finished modifying their particular sets of chunks, they have moved on and >>> are waiting for everyone to get together and broadcast their new chunk sizes >>> so that free space in the file can be collectively re-allocated, but of >>> course rank 0 is not proceeding forward. My best guess is that either: >>> >>> >>> The "num_writers" field for the chunk struct corresponding to the particular >>> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to >>> think that there are more ranks writing to the chunk than the actual amount >>> and consequently causing rank 0 to wait forever for a non-existent MPI >>> message >>> >>> >>> or >>> >>> >>> The "new_owner" field of the chunk struct for this chunk was incorrectly set >>> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also >>> causing rank 0 to wait for a non-existent MPI message >>> >>> >>> This feature should still be regarded as being in beta and its complexity >>> can lead to difficult to track down bugs such as the ones you are currently >>> encountering. That being said, your feedback is very useful and will help to >>> push this feature towards a production-ready level of quality. Also, if it >>> is feasible to come up with a minimal example that reproduces this issue, it >>> would be very helpful and would make it much easier to diagnose why exactly >>> these failures are occurring. >>> >>> Thanks, >>> Jordan >>> >>> ________________________________ >>> From: Hdf-forum <hdf-forum-boun...@lists.hdfgroup.org> on behalf of Michael >>> K. Edwards <m.k.edwa...@gmail.com> >>> Sent: Wednesday, November 8, 2017 11:23 AM >>> To: Miller, Mark C. >>> Cc: HDF Users Discussion List >>> Subject: Re: [Hdf-forum] Collective IO and filters >>> >>> Closer to 1000 ranks initially. There's a bug in handling the case >>> where some of the writers don't have any data to contribute (because >>> there's a dimension smaller than the number of ranks), which I have >>> worked around like this: >>> >>> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c >>> index af6599a..9522478 100644 >>> --- a/src/H5Dchunk.c >>> +++ b/src/H5Dchunk.c >>> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t >>> *fm) >>> /* Indicate that the chunk's memory space is shared */ >>> chunk_info->mspace_shared = TRUE; >>> } /* end if */ >>> + else if(H5SL_count(fm->sel_chunks)==0) { >>> + /* No chunks, because no local data; avoid >>> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */ >>> + } /* end else if */ >>> else { >>> /* Get bounding box for file selection */ >>> if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end) >>> < 0) >>> >>> That makes the assert go away. Now I'm investigating a hang in the >>> chunk redistribution logic in rank 0, with a backtrace that looks like >>> this: >>> >>> #0 0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2 >>> #1 0x00007f4bd5d3b341 in psm_progress_wait () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #2 0x00007f4bd5d3012d in MPID_Mprobe () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #3 0x00007f4bd5cbeeb4 in PMPI_Mprobe () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #4 0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks >>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, >>> local_chunk_array=0x17f0f80, >>> local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041 >>> #5 0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list >>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, >>> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00) >>> at H5Dmpio.c:2794 >>> #6 0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io >>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0, >>> dx_plist=0x16f7230) at H5Dmpio.c:1447 >>> #7 0x00007f4bd81a027d in H5D__chunk_collective_io >>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at >>> H5Dmpio.c:933 >>> #8 0x00007f4bd81a0968 in H5D__chunk_collective_write >>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104, >>> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at >>> H5Dmpio.c:1018 >>> #9 0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010, >>> mem_type_id=216172782113783851, mem_space=0x17dc770, >>> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at >>> H5Dio.c:835 >>> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010, >>> direct_write=false, mem_type_id=216172782113783851, >>> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384, >>> buf=0x17d6240) >>> at H5Dio.c:394 >>> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680, >>> mem_type_id=216172782113783851, mem_space_id=288230376151711749, >>> file_space_id=288230376151711750, dxpl_id=720575940379279384, >>> buf=0x17d6240) at H5Dio.c:318 >>> >>> The other ranks have moved past this and are hanging here: >>> >>> #0 0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2 >>> #1 0x00007feb6fe25341 in psm_progress_wait () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #2 0x00007feb6fdd8975 in MPIC_Wait () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #3 0x00007feb6fdd918b in MPIC_Sendrecv () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #4 0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #5 0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #6 0x00007feb6fca1534 in MPIR_Allreduce_impl () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #7 0x00007feb6fca1b93 in PMPI_Allreduce () from >>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12 >>> #8 0x00007feb72287c2a in H5D__mpio_array_gatherv >>> (local_array=0x125f2d0, local_array_num_entries=0, >>> array_entry_size=368, _gathered_array=0x7ffff083f1d8, >>> _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4, >>> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at >>> H5Dmpio.c:479 >>> #9 0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io >>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280, >>> dx_plist=0x11cf240) at H5Dmpio.c:1479 >>> #10 0x00007feb7228a27d in H5D__chunk_collective_io >>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at >>> H5Dmpio.c:933 >>> #11 0x00007feb7228a968 in H5D__chunk_collective_write >>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74, >>> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at >>> H5Dmpio.c:1018 >>> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0, >>> mem_type_id=216172782113783851, mem_space=0x124b450, >>> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at >>> H5Dio.c:835 >>> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0, >>> direct_write=false, mem_type_id=216172782113783851, >>> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384, >>> buf=0x1244e80) >>> at H5Dio.c:394 >>> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680, >>> mem_type_id=216172782113783851, mem_space_id=288230376151711749, >>> file_space_id=288230376151711750, dxpl_id=720575940379279384, >>> buf=0x1244e80) at H5Dio.c:318 >>> >>> (I'm currently running with this patch atop commit bf570b1, on an >>> earlier theory that the crashing bug may have crept in after Jordan's >>> big merge. I'll rebase on current develop but I doubt that'll change >>> much.) >>> >>> The hang may or may not be directly related to the workaround being a >>> bit of a hack. I can set you up with full reproduction details if you >>> like; I seem to be getting some traction on it, but more eyeballs are >>> always good, especially if they're better set up for MPI tracing than >>> I am right now. >>> >>> >>> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <mille...@llnl.gov> wrote: >>>> Hi Michael, >>>> >>>> >>>> >>>> I have not tried this in parallel yet. That said, what scale are you >>>> trying >>>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between? >>>> >>>> >>>> >>>> My understanding is that there are some known scaling issues out past >>>> maybe >>>> 10,000 ranks. Not heard of outright assertion failures there though. >>>> >>>> >>>> >>>> Mark >>>> >>>> >>>> >>>> >>>> >>>> "Hdf-forum on behalf of Michael K. Edwards" wrote: >>>> >>>> >>>> >>>> I'm trying to write an HDF5 file with dataset compression from an MPI >>>> >>>> job. (Using PETSc 3.8 compiled against MVAPICH2, if that matters.) >>>> >>>> After running into the "Parallel I/O does not support filters yet" >>>> >>>> error message in release versions of HDF5, I have turned to the >>>> >>>> develop branch. Clearly there has been much work towards collective >>>> >>>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly >>>> >>>> it is not quite ready for prime time yet. So far I've encountered a >>>> >>>> livelock scenario with ZFP, reproduced it with SZIP, and, with no >>>> >>>> filters at all, obtained this nifty error message: >>>> >>>> >>>> >>>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion >>>> >>>> `fm->m_ndims==fm->f_ndims' failed. >>>> >>>> >>>> >>>> Has anyone on this list been able to write parallel HDF5 using a >>>> >>>> recent state of the develop branch, with or without filters >>>> >>>> configured? >>>> >>>> >>>> >>>> Thanks, >>>> >>>> - Michael >>>> >>>> >>>> >>>> _______________________________________________ >>>> >>>> Hdf-forum is for HDF software users discussion. >>>> >>>> Hdf-forum@lists.hdfgroup.org >>>> >>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>> >>>> Twitter: https://twitter.com/hdf5 >>>> >>>> >>> >>> _______________________________________________ >>> Hdf-forum is for HDF software users discussion. >>> Hdf-forum@lists.hdfgroup.org >>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>> Twitter: https://twitter.com/hdf5 >>> The HDF Group (@hdf5) | Twitter >>> twitter.com >>> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting >>> services that make possible the management of large, complex data >>> collections. Support ... >>> _______________________________________________ Hdf-forum is for HDF software users discussion. Hdf-forum@lists.hdfgroup.org http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5