Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Frijters, S.C.J. Wed, 24 Mar 2010 09:11:49 -0700

Hi Quincey,

I can double one dimension on my chunk size (at the cost of really slow IO), 
but if I double them all I get errors like these:


HDF5-DIAG: Error detected in HDF5 (1.8.4) MPI-process 4:
  #000: H5D.c line 171 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 428 in H5D_create_named(): unable to create and link to 
dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1639 in H5L_link_object(): unable to create new link to 
object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1862 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: H5Gtraverse.c line 877 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 703 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1685 in H5L_link_cb(): unable to create object
    major: Object header
    minor: Unable to initialize object
  #007: H5O.c line 2677 in H5O_obj_create(): unable to open object
    major: Object header
    minor: Can't open object
  #008: H5Doh.c line 296 in H5O_dset_create(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: H5Dint.c line 1030 in H5D_create(): unable to construct layout 
information
    major: Dataset
    minor: Unable to initialize object
  #010: H5Dchunk.c line 420 in H5D_chunk_construct(): chunk size must be <= 
maximum dimension size for fixed-sized dimensions
    major: Dataset
    minor: Unable to initialize object

I am currently doing test runs on my local machine on 16 cores because the 
large machine I run jobs on is unavailable at the moment and has a queueing 
system rather unsuited to quick test runs, so maybe this is an artefact of 
running on such a small number of cores? Although I *think* I tried this before 
and got the same type of error on several thousand cores also.

Kind regards,

Stefan Frijters

________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Quincey Koziol [[email protected]]
Sent: 24 March 2010 16:28
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Hi Stefan,

On Mar 24, 2010, at 10:10 AM, Stefan Frijters wrote:

> Hi Quincey,
>
> Thanks for the quick response. Currently, each core is handling its
> datasets with a chunk size equal to the size of the local data (the dims
> parameter in h5pset_chunk_f is equal to to dims parameter in
> h5dwrite_f)  because the local arrays are not that large anyway (in the
> order of 20x20x20 reals), so if I understand things correctly I'm
> already using maximum chunk size.

        No, you don't have to make them the same size, since the collective I/O 
should stitch them back together anyway.  Try doubling the dimensions on your 
chunks.

> Do you have an idea why it doesn't crash the first time I try to do it
> though? It's a different array, but of the same size and datatype as the
> second. As far as I can see I'm closing all used handles at the end of
> my function at least.

        Hmm, I'm not certain...

                Quincey

> Kind regards,
>
> Stefan Frijters
>
>> Hi Stefan,
>>
>> On Mar 24, 2010, at 3:11 AM, Stefan Frijters wrote:
>>
>>> Dear all,
>>>
>>> Recently, I've run into a problem with my parallel HDF5 writes. My
>>> program works fine on 8k cores, but when I run it on 16k cores it
>>> crashes when writing a data file through h5dwrite_f(...).
>>> File writes go through one function in the code, so it always uses the
>>> same code, but for some reason I don't understand it writes one file
>>> without problems, but the second one throws the following error message:
>>>
>>> Abort(1) on node 0 (rank 0 in comm 1140850688): Fatal error in
>>> MPI_Gather: Invalid buffer pointer, error stack:
>>> MPI_Gather(758): MPI_Gather(sbuf=0xa356f400, scount=16000, MPI_BYTE,
>>> rbuf=(nil), rcount=16000, MPI_BYTE, root=0, comm=0x84000003) failed
>>> MPI_Gather(675): Null buffer pointer
>>>
>>> I've been looking through the HDF5 source code and it only seems to call
>>> MPI_Gather in one place, in the function H5D_obtain_mpio_mode. In that
>>> function HDF tries to allocate a receive buffer using
>>>
>>> recv_io_mode_info = (uint8_t *)H5MM_malloc(total_chunks * mpi_size);
>>>
>>> Which then returns the null pointer seen in rbuf=(nil) instead of a
>>> valid pointer. Thus, to me it seems it's HDF causing the problem and not
>>> MPI.
>>>
>>> This problem occurs in both collective and independent IO mode.
>>>
>>> Do you have any idea what might be causing this problem, or how to
>>> resolve it? I'm not sure what kind of other information you might need,
>>> but I'll do my best to supply it, if you need any.
>>
>> This is a scalability problem we are aware of and are working to address,
>> but in the meanwhile, can you increase the size of your chunks for your
>> dataset(s)?  (That will reduce the number of chunks and the size of the
>> buffer being allocated)
>>
>>      Quincey
>>
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] HDF5 causes Fatal error in MPI_Gather

Reply via email to