Hi Stefan,
On Mar 24, 2010, at 11:06 AM, Frijters, S.C.J. wrote:
> Hi Quincey,
>
> I can double one dimension on my chunk size (at the cost of really slow IO),
> but if I double them all I get errors like these:
>
> HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
> #000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
> major: Dataset
> minor: Unable to initialize object
> #001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to
> dataset
> major: Dataset
> minor: Unable to initialize object
> #002: H5L.c line 1639 in H5L_link_object(): unable to create new link to
> object
> major: Links
> minor: Unable to initialize object
> #003: H5L.c line 1862 in H5L_create_real(): can't insert link
> major: Symbol table
> minor: Unable to insert object
> #004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal
> failed
> major: Symbol table
> minor: Object not found
> #005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator
> failed
> major: Symbol table
> minor: Callback failed
> #006: H5L.c line 1685 in H5L_link_cb(): unable to create object
> major: Object header
> minor: Unable to initialize object
> #007: H5O.c line 2677 in H5O_obj_create(): unable to open object
> major: Object header
> minor: Can't open object
> #008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
> major: Dataset
> minor: Unable to initialize object
> #009: H5Dint.c line 1030 in H5D_create(): unable to construct layout
> information
> major: Dataset
> minor: Unable to initialize object
> #010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <=
> maximum dimension size for fixed-sized dimensions
> major: Dataset
> minor: Unable to initialize object
>
> I am currently doing test runs on my local machine on 16 cores because the
> large machine I run jobs on is unavailable at the moment and has a queueing
> system rather unsuited to quick test runs, so maybe this is an artefact of
> running on such a small number of cores? Although I *think* I tried this
> before and got the same type of error on several thousand cores also.
You seem to have increased the chunk dimension to be larger than the
dataset dimension. What is the chunk size and dataspace size you are using?
Quincey
>
> Kind regards,
>
> Stefan Frijters
>
> ________________________________________
> From: [email protected] [[email protected]] On
> Behalf Of Quincey Koziol [[email protected]]
> Sent: 24 March 2010 16:28
> To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather
>
> Hi Stefan,
>
> On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:
>
>> Hi Quincey,
>>
>> Thanks for the quick response. Currently, each core is handling its
>> datasets with a chunk size equal to the size of the local data (the dims
>> parameter in h5pset_chunk_f is equal to to dims parameter in
>> h5dwrite_f) because the local arrays are not that large anyway (in the
>> order of 20x20x20 reals), so if I understand things correctly I'm
>> already using maximum chunk size.
>
> No, you don't have to make them the same size, since the collective
> I/O should stitch them back together anyway. Try doubling the dimensions on
> your chunks.
>
>> Do you have an idea why it doesn't crash the first time I try to do it
>> though? It's a different array, but of the same size and datatype as the
>> second. As far as I can see I'm closing all used handles at the end of
>> my function at least.
>
> Hmm, I'm not certain...
>
> Quincey
>
>> Kind regards,
>>
>> Stefan Frijters
>>
>>> Hi Stefan,
>>>
>>> On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:
>>>
>>>> Dear all,
>>>>
>>>> Recently, I've run into a problem with my parallel HDF5 writes. My
>>>> program works fine on 8k cores, but when I run it on 16k cores it
>>>> crashes when writing a data file through h5dwrite_f(...).
>>>> File writes go through one function in the code, so it always uses the
>>>> same code, but for some reason I don't understand it writes one file
>>>> without problems, but the second one throws the following error message:
>>>>
>>>> Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
>>>> MPI_Gather: Invalid buffer pointer, error stack:
>>>> MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
>>>> rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
>>>> MPI_Gather(675): Null buffer pointer
>>>>
>>>> I've been looking through the HDF5 source code and it only seems to call
>>>> MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
>>>> function HDF tries to allocate a receive buffer using
>>>>
>>>> recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);
>>>>
>>>> Which then returns the null pointer seen in rbuf=(nil) instead of a
>>>> valid pointer. Thus, to me it seems it's HDF causing the problem and not
>>>> MPI.
>>>>
>>>> This problem occurs in both collective and independent IO mode.
>>>>
>>>> Do you have any idea what might be causing this problem, or how to
>>>> resolve it? I'm not sure what kind of other information you might need,
>>>> but I'll do my best to supply it, if you need any.
>>>
>>> This is a scalability problem we are aware of and are working to address,
>>> but in the meanwhile, can you increase the size of your chunks for your
>>> dataset(s)? (That will reduce the number of chunks and the size of the
>>> buffer being allocated)
>>>
>>> Quincey
>>>
>>
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org