Re: [Hdf-forum] Collective IO and filters

Michael K. Edwards Wed, 08 Nov 2017 12:35:23 -0800

Also, I should add that the HDF5 files appear to be written properly
when run under "mpiexec -n 1", and valgrind doesn't report any bogus
malloc/free calls or wild pointers.  So I don't think it's a problem
with how I've massaged the H5Z plugins or the PETSc code.



On Wed, Nov 8, 2017 at 12:22 PM, Michael K. Edwards
<m.k.edwa...@gmail.com> wrote:
> It's not even clear to me yet whether this is the same dataset that
> triggered the assert.  Working on getting complete details.  But FWIW
> the PETSc code does not call H5Sselect_none().  It calls
> H5Sselect_hyperslab() in all ranks, and that's why the ranks in which
> the slice is zero columns wide hit the "empty sel_chunks" pathway I
> added to H5D__create_chunk_mem_map_hyper().
>
>
> On Wed, Nov 8, 2017 at 12:02 PM, Michael K. Edwards
> <m.k.edwa...@gmail.com> wrote:
>> Thanks, Jordan.  I recognize that this is very recent feature work and
>> my goal is to help push it forward.
>>
>> My current use case is relatively straightforward, though there are a
>> couple of layers on top of HDF5 itself.  The problem can be reproduced
>> by building PETSc 3.8.1 against libraries built from the develop
>> branch of HDF5, adding in the H5Dset_filter() calls, and running an
>> example that exercises them.  (I'm using
>> src/snes/examples/tutorials/ex12.c with the -dm_view_hierarchy flag to
>> induce HDF5 writes.)  If you want, I can supply full details for you
>> to reproduce it locally, or I can do any experiments you'd like me to
>> within this setup.  (It also involves patches to the out-of-tree H5Z
>> plugins to make them use H5MM_malloc/H5MM_xfree rather than raw
>> malloc/free, which in turn involves exposing H5MMprivate.h to the
>> plugins.  Is this something you've solved in a different way?)
>>
>>
>> On Wed, Nov 8, 2017 at 11:44 AM, Jordan Henderson
>> <jhender...@hdfgroup.org> wrote:
>>> Hi Michael,
>>>
>>>
>>> during the design phase of this feature I tried to both account for and test
>>> the case where some of the writers do not have any data to contribute.
>>> However, it seems like your use case falls outside of what I have tested
>>> (perhaps I have not used enough ranks?). In particular my test cases were
>>> small and simply had some of the ranks call H5Sselect_none(), which doesn't
>>> seem to trigger this particular assertion failure. Is this how you're
>>> approaching these particular ranks in your code or is there a different way
>>> you are having them participate in the write operation?
>>>
>>>
>>> As for the hanging issue, it looks as though rank 0 is waiting to receive
>>> some modification data from another rank for a particular chunk. Whether or
>>> not there is actually valid data that rank 0 should be waiting for, I cannot
>>> easily tell without being able to trace it through. As the other ranks have
>>> finished modifying their particular sets of chunks, they have moved on and
>>> are waiting for everyone to get together and broadcast their new chunk sizes
>>> so that free space in the file can be collectively re-allocated, but of
>>> course rank 0 is not proceeding forward. My best guess is that either:
>>>
>>>
>>> The "num_writers" field for the chunk struct corresponding to the particular
>>> chunk that rank 0 is working on has been incorrectly set, causing rank 0 to
>>> think that there are more ranks writing to the chunk than the actual amount
>>> and consequently causing rank 0 to wait forever for a non-existent MPI
>>> message
>>>
>>>
>>> or
>>>
>>>
>>> The "new_owner" field of the chunk struct for this chunk was incorrectly set
>>> on the other ranks, causing them to never issue an MPI_Isend to rank 0, also
>>> causing rank 0 to wait for a non-existent MPI message
>>>
>>>
>>> This feature should still be regarded as being in beta and its complexity
>>> can lead to difficult to track down bugs such as the ones you are currently
>>> encountering. That being said, your feedback is very useful and will help to
>>> push this feature towards a production-ready level of quality. Also, if it
>>> is feasible to come up with a minimal example that reproduces this issue, it
>>> would be very helpful and would make it much easier to diagnose why exactly
>>> these failures are occurring.
>>>
>>> Thanks,
>>> Jordan
>>>
>>> ________________________________
>>> From: Hdf-forum <hdf-forum-boun...@lists.hdfgroup.org> on behalf of Michael
>>> K. Edwards <m.k.edwa...@gmail.com>
>>> Sent: Wednesday, November 8, 2017 11:23 AM
>>> To: Miller, Mark C.
>>> Cc: HDF Users Discussion List
>>> Subject: Re: [Hdf-forum] Collective IO and filters
>>>
>>> Closer to 1000 ranks initially.  There's a bug in handling the case
>>> where some of the writers don't have any data to contribute (because
>>> there's a dimension smaller than the number of ranks), which I have
>>> worked around like this:
>>>
>>> diff --git a/src/H5Dchunk.c b/src/H5Dchunk.c
>>> index af6599a..9522478 100644
>>> --- a/src/H5Dchunk.c
>>> +++ b/src/H5Dchunk.c
>>> @@ -1836,6 +1836,9 @@ H5D__create_chunk_mem_map_hyper(const H5D_chunk_map_t
>>> *fm)
>>>          /* Indicate that the chunk's memory space is shared */
>>>          chunk_info->mspace_shared = TRUE;
>>>      } /* end if */
>>> +    else if(H5SL_count(fm->sel_chunks)==0) {
>>> +        /* No chunks, because no local data; avoid
>>> HDassert(fm->m_ndims==fm->f_ndims) on null mem_space */
>>> +    } /* end else if */
>>>      else {
>>>          /* Get bounding box for file selection */
>>>          if(H5S_SELECT_BOUNDS(fm->file_space, file_sel_start, file_sel_end)
>>> < 0)
>>>
>>> That makes the assert go away.  Now I'm investigating a hang in the
>>> chunk redistribution logic in rank 0, with a backtrace that looks like
>>> this:
>>>
>>> #0  0x00007f4bd456a6c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>>> #1  0x00007f4bd5d3b341 in psm_progress_wait () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #2  0x00007f4bd5d3012d in MPID_Mprobe () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #3  0x00007f4bd5cbeeb4 in PMPI_Mprobe () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #4  0x00007f4bd81aadf6 in H5D__chunk_redistribute_shared_chunks
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>>> local_chunk_array=0x17f0f80,
>>>     local_chunk_array_num_entries=0x7ffdfb83d9f8) at H5Dmpio.c:3041
>>> #5  0x00007f4bd81a9696 in H5D__construct_filtered_io_info_list
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>>> chunk_list=0x7ffdfb83daf0, num_entries=0x7ffdfb83db00)
>>>     at H5Dmpio.c:2794
>>> #6  0x00007f4bd81a2d58 in H5D__link_chunk_filtered_collective_io
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0,
>>> dx_plist=0x16f7230) at H5Dmpio.c:1447
>>> #7  0x00007f4bd81a027d in H5D__chunk_collective_io
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, fm=0x17eeec0) at
>>> H5Dmpio.c:933
>>> #8  0x00007f4bd81a0968 in H5D__chunk_collective_write
>>> (io_info=0x7ffdfb83de60, type_info=0x7ffdfb83dde0, nelmts=104,
>>> file_space=0x17e2dc0, mem_space=0x17dc770, fm=0x17eeec0) at
>>> H5Dmpio.c:1018
>>> #9  0x00007f4bd7ce3d63 in H5D__write (dataset=0x17e0010,
>>> mem_type_id=216172782113783851, mem_space=0x17dc770,
>>> file_space=0x17e2dc0, dxpl_id=720575940379279384, buf=0x17d6240) at
>>> H5Dio.c:835
>>> #10 0x00007f4bd7ce181c in H5D__pre_write (dset=0x17e0010,
>>> direct_write=false, mem_type_id=216172782113783851,
>>> mem_space=0x17dc770, file_space=0x17e2dc0, dxpl_id=720575940379279384,
>>> buf=0x17d6240)
>>>     at H5Dio.c:394
>>> #11 0x00007f4bd7ce0fd1 in H5Dwrite (dset_id=360287970189639680,
>>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>>     buf=0x17d6240) at H5Dio.c:318
>>>
>>> The other ranks have moved past this and are hanging here:
>>>
>>> #0  0x00007feb6e6546c6 in psm2_mq_ipeek2 () from /lib64/libpsm2.so.2
>>> #1  0x00007feb6fe25341 in psm_progress_wait () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #2  0x00007feb6fdd8975 in MPIC_Wait () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #3  0x00007feb6fdd918b in MPIC_Sendrecv () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #4  0x00007feb6fcf0fda in MPIR_Allreduce_pt2pt_rd_MV2 () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #5  0x00007feb6fcf48ef in MPIR_Allreduce_index_tuned_intra_MV2 () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #6  0x00007feb6fca1534 in MPIR_Allreduce_impl () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #7  0x00007feb6fca1b93 in PMPI_Allreduce () from
>>> /usr/mpi/gcc/mvapich2-2.2-hfi/lib/libmpi.so.12
>>> #8  0x00007feb72287c2a in H5D__mpio_array_gatherv
>>> (local_array=0x125f2d0, local_array_num_entries=0,
>>> array_entry_size=368, _gathered_array=0x7ffff083f1d8,
>>>     _gathered_array_num_entries=0x7ffff083f1e8, nprocs=4,
>>> allgather=true, root=0, comm=-1006632952, sort_func=0x0) at
>>> H5Dmpio.c:479
>>> #9  0x00007feb7228cfb8 in H5D__link_chunk_filtered_collective_io
>>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280,
>>> dx_plist=0x11cf240) at H5Dmpio.c:1479
>>> #10 0x00007feb7228a27d in H5D__chunk_collective_io
>>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, fm=0x125d280) at
>>> H5Dmpio.c:933
>>> #11 0x00007feb7228a968 in H5D__chunk_collective_write
>>> (io_info=0x7ffff083f540, type_info=0x7ffff083f4c0, nelmts=74,
>>> file_space=0x12514e0, mem_space=0x124b450, fm=0x125d280) at
>>> H5Dmpio.c:1018
>>> #12 0x00007feb71dcdd63 in H5D__write (dataset=0x124e7d0,
>>> mem_type_id=216172782113783851, mem_space=0x124b450,
>>> file_space=0x12514e0, dxpl_id=720575940379279384, buf=0x1244e80) at
>>> H5Dio.c:835
>>> #13 0x00007feb71dcb81c in H5D__pre_write (dset=0x124e7d0,
>>> direct_write=false, mem_type_id=216172782113783851,
>>> mem_space=0x124b450, file_space=0x12514e0, dxpl_id=720575940379279384,
>>> buf=0x1244e80)
>>>     at H5Dio.c:394
>>> #14 0x00007feb71dcafd1 in H5Dwrite (dset_id=360287970189639680,
>>> mem_type_id=216172782113783851, mem_space_id=288230376151711749,
>>> file_space_id=288230376151711750, dxpl_id=720575940379279384,
>>>     buf=0x1244e80) at H5Dio.c:318
>>>
>>> (I'm currently running with this patch atop commit bf570b1, on an
>>> earlier theory that the crashing bug may have crept in after Jordan's
>>> big merge.  I'll rebase on current develop but I doubt that'll change
>>> much.)
>>>
>>> The hang may or may not be directly related to the workaround being a
>>> bit of a hack.  I can set you up with full reproduction details if you
>>> like; I seem to be getting some traction on it, but more eyeballs are
>>> always good, especially if they're better set up for MPI tracing than
>>> I am right now.
>>>
>>>
>>> On Wed, Nov 8, 2017 at 8:48 AM, Miller, Mark C. <mille...@llnl.gov> wrote:
>>>> Hi Michael,
>>>>
>>>>
>>>>
>>>> I have not tried this in parallel yet. That said, what scale are you
>>>> trying
>>>> to do this at? 1000 ranks or 1,000,000 ranks? Something in between?
>>>>
>>>>
>>>>
>>>> My understanding is that there are some known scaling issues out past
>>>> maybe
>>>> 10,000 ranks. Not heard of outright assertion failures there though.
>>>>
>>>>
>>>>
>>>> Mark
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> "Hdf-forum on behalf of Michael K. Edwards" wrote:
>>>>
>>>>
>>>>
>>>> I'm trying to write an HDF5 file with dataset compression from an MPI
>>>>
>>>> job.  (Using PETSc 3.8 compiled against MVAPICH2, if that matters.)
>>>>
>>>> After running into the "Parallel I/O does not support filters yet"
>>>>
>>>> error message in release versions of HDF5, I have turned to the
>>>>
>>>> develop branch.  Clearly there has been much work towards collective
>>>>
>>>> filtered IO in the run-up to a 1.11 (1.12?) release; equally clearly
>>>>
>>>> it is not quite ready for prime time yet.  So far I've encountered a
>>>>
>>>> livelock scenario with ZFP, reproduced it with SZIP, and, with no
>>>>
>>>> filters at all, obtained this nifty error message:
>>>>
>>>>
>>>>
>>>> ex12: H5Dchunk.c:1849: H5D__create_chunk_mem_map_hyper: Assertion
>>>>
>>>> `fm->m_ndims==fm->f_ndims' failed.
>>>>
>>>>
>>>>
>>>> Has anyone on this list been able to write parallel HDF5 using a
>>>>
>>>> recent state of the develop branch, with or without filters
>>>>
>>>> configured?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> - Michael
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>>
>>>> Hdf-forum is for HDF software users discussion.
>>>>
>>>> Hdf-forum@lists.hdfgroup.org
>>>>
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>
>>>> Twitter: https://twitter.com/hdf5
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> Hdf-forum@lists.hdfgroup.org
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>> The HDF Group (@hdf5) | Twitter
>>> twitter.com
>>> The latest Tweets from The HDF Group (@hdf5). Technologies and supporting
>>> services that make possible the management of large, complex data
>>> collections. Support ...
>>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to