Hi all,

Chris Calderon, a user at NERSC, is receiving the errors at the bottom
of the email during the following scenario:

- a subset of 40 MPI tasks are each opening their own HDF5 file with
MPI-IO in collective mode with the MPI_COMM_SELF communicator
- each task writes about 20,000 small datasets totaling 10GB per file

It's worth noting that we don't intend to use MPI-IO in independent mode, so
we don't really need to fix this error to make the code operational,
but we'd like to understand why the error occurred. At the lowest
level, the error is "can't convert from size to size_i" and looking up
the relevant code, I found:

 size_i = (int)size;
   if((hsize_t)size_i != size)
       HGOTO_ERROR...

So my guess is that the offsets at some point become large enough to
cause an int32 overflow. (Each file is about 10GB total, so the
overflow probably occurs around the 8GB mark since 2 billion elements
times 4 bytes per float = 8GB.) Is this a known bug in the MPI-IO VFD?
This suggests that the bug will also affect independent mode, but
another work around is for us to use the MPI-POSIX VFD, which should
bypass this problem.

I looked into using the CORE VFD per Mark Miller's suggestion in an earlier
thread, but the problem is that the 10GB of data will not fit into memory,
and I didn't see any API calls for requesting a "dump to file" before the
file close.

Thanks
Mark

----

HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 16:
 #000: H5Dio.c line 266 in H5Dwrite(): can't write data
  major: Dataset
  minor: Write failed
 #001: H5Dio.c line 578 in H5D_write(): can't write data
  major: Dataset
  minor: Write failed
 #002: H5Dmpio.c line 552 in H5D_contig_collective_write(): couldn't
finish shared collective MPI-IO
  major: Low-level I/O
  minor: Write failed
 #003: H5Dmpio.c line 1586 in H5D_inter_collective_io(): couldn't
finish collective MPI-IO
  major: Low-level I/O
  minor: Can't get value
 #004: H5Dmpio.c line 1632 in H5D_final_collective_io(): optimized write failed
  major: Dataset
  minor: Write failed
 #005: H5Dmpio.c line 334 in H5D_mpio_select_write(): can't finish
collective parallel write
  major: Low-level I/O
  minor: Write failed
 #006: H5Fio.c line 167 in H5F_block_write(): file write failed
  major: Low-level I/O
  minor: Write failed
 #007: H5FDint.c line 185 in H5FD_write(): driver write request failed
  major: Virtual File Layer
  minor: Write failed
 #008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
size to size_i
  major: Internal error (too specific to document in detail)
  minor: Out of range

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to