Hi Mark,
Sorry for the delay in replying...
On May 18, 2010, at 11:31 AM, Mark Howison wrote:
> Hi all,
>
> Chris Calderon, a user at NERSC, is receiving the errors at the bottom
> of the email during the following scenario:
>
> - a subset of 40 MPI tasks are each opening their own HDF5 file with
> MPI-IO in collective mode with the MPI_COMM_SELF communicator
> - each task writes about 20,000 small datasets totaling 10GB per file
>
> It's worth noting that we don't intend to use MPI-IO in independent mode, so
> we don't really need to fix this error to make the code operational,
> but we'd like to understand why the error occurred. At the lowest
> level, the error is "can't convert from size to size_i" and looking up
> the relevant code, I found:
>
> size_i = (int)size;
> if((hsize_t)size_i != size)
> HGOTO_ERROR...
>
> So my guess is that the offsets at some point become large enough to
> cause an int32 overflow. (Each file is about 10GB total, so the
> overflow probably occurs around the 8GB mark since 2 billion elements
> times 4 bytes per float = 8GB.) Is this a known bug in the MPI-IO VFD?
> This suggests that the bug will also affect independent mode, but
> another work around is for us to use the MPI-POSIX VFD, which should
> bypass this problem.
There is a limitation in the MPI standard which specifies that an 'int'
type must be used for certain file operations, but we may be able to relax that
for the MPI-POSIX driver. Could you give me the line number for the code
snippet above? I'll take a look and see if it really needs to be there.
Thanks,
Quincey
> I looked into using the CORE VFD per Mark Miller's suggestion in an earlier
> thread, but the problem is that the 10GB of data will not fit into memory,
> and I didn't see any API calls for requesting a "dump to file" before the
> file close.
>
> Thanks
> Mark
>
> ----
>
> HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 16:
> #000: H5Dio.c line 266 in H5Dwrite(): can't write data
> major: Dataset
> minor: Write failed
> #001: H5Dio.c line 578 in H5D_write(): can't write data
> major: Dataset
> minor: Write failed
> #002: H5Dmpio.c line 552 in H5D_contig_collective_write(): couldn't
> finish shared collective MPI-IO
> major: Low-level I/O
> minor: Write failed
> #003: H5Dmpio.c line 1586 in H5D_inter_collective_io(): couldn't
> finish collective MPI-IO
> major: Low-level I/O
> minor: Can't get value
> #004: H5Dmpio.c line 1632 in H5D_final_collective_io(): optimized write failed
> major: Dataset
> minor: Write failed
> #005: H5Dmpio.c line 334 in H5D_mpio_select_write(): can't finish
> collective parallel write
> major: Low-level I/O
> minor: Write failed
> #006: H5Fio.c line 167 in H5F_block_write(): file write failed
> major: Low-level I/O
> minor: Write failed
> #007: H5FDint.c line 185 in H5FD_write(): driver write request failed
> major: Virtual File Layer
> minor: Write failed
> #008: H5FDmpio.c line 1726 in H5FD_mpio_write(): can't convert from
> size to size_i
> major: Internal error (too specific to document in detail)
> minor: Out of range
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org