Re: [Hdf-forum] Collective IO and filters

Michael K. Edwards Thu, 09 Nov 2017 13:39:44 -0800

Thank you for the explanation.  That's consistent with what I see when
I add a debug printf into H5D__construct_filtered_io_info_list().  So
I'm now looking into the filter situation.  It's possible that the
H5Z-blosc glue is mishandling the case where the compressed data is
larger than the uncompressed data.


About to write 12 of 20
About to write 0 of 20
About to write 0 of 20
About to write 8 of 20
Rank 0 selected 12 of 20
Rank 1 selected 8 of 20
HDF5-DIAG: Error detected in HDF5 (1.11.0) MPI-process 0:
  #000: H5Dio.c line 319 in H5Dwrite(): can't prepare for writing data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 395 in H5D__pre_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dio.c line 836 in H5D__write(): can't write data
    major: Dataset
    minor: Write failed
  #003: H5Dmpio.c line 1019 in H5D__chunk_collective_write(): write error
    major: Dataspace
    minor: Write failed
  #004: H5Dmpio.c line 934 in H5D__chunk_collective_io(): couldn't
finish filtered linked chunk MPI-IO
    major: Low-level I/O
    minor: Can't get value
  #005: H5Dmpio.c line 1474 in
H5D__link_chunk_filtered_collective_io(): couldn't process chunk entry
    major: Dataset
    minor: Write failed
  #006: H5Dmpio.c line 3278 in
H5D__filtered_collective_chunk_entry_io(): couldn't unfilter chunk for
modifying
    major: Data filters
    minor: Filter operation failed
  #007: H5Z.c line 1256 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed



On Thu, Nov 9, 2017 at 1:02 PM, Jordan Henderson
<jhender...@hdfgroup.org> wrote:
> For the purpose of collective I/O it is true that all ranks must call
> H5Dwrite() so that they can participate in those collective operations that
> are necessary (the file space re-allocation and so on). However, even though
> they called H5Dwrite() with a valid memspace, the fact that they have a NONE
> selection in the given file space should cause their chunk-file mapping
> struct (see lines 357-385 of H5Dpkg.h for the struct's definition and the
> code for H5D__link_chunk_filtered_collective_io() to see how it uses this
> built up list of chunks selected in the file) to contain no entries in the
> "fm->sel_chunks" field. That alone should mean that during the chunk
> redistribution, they will not actually send anything at all to any of the
> ranks. They only participate there for the sake that, were the method of
> redistribution modified, ranks which previously had no chunks selected could
> potentially be given some chunks to work on.
>
>
> For all practical purposes, every single chunk_entry seen in the list from
> rank 0's perspective should be a valid I/O caused by some rank writing some
> positive amount of bytes to the chunk. On rank 0's side, you should be able
> to check the io_size field of each of the chunk_entry entries and see how
> big the I/O is from the "original_owner" to that chunk. If any of these are
> 0, something is likely very wrong. If that is indeed the case, you could
> likely pull a hacky workaround by manually removing them from the list, but
> I'd be more concerned about the root of the problem if there are zero-size
> I/O chunk_entry entries being added to the list.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Collective IO and filters

Reply via email to