Hi Paul,
On May 28, 2010, at 1:58 AM, Paul Hilscher wrote:
> Dear Elena and John,
>
> thanks a lot for your reply and for your explanation of the problem. Thanks
> to your help I could
> find the problem. I wrote an attribute using
>
> H5LTset_attribute_string(infoGroup, ".", "Description", description.c_str())
>
> but description (std::string) was only set by the master process and was an
> empty string for all other
> processes. Once I removed this line or setting it equal for all other
> processes pHDF5 works perfectly fine.
> Unfortunately this error broke the whole HDF5 Subsystem, cause not only pHDF5
> could not close the file but
> the save data itself was manipulated (sometimes I was still able to open &
> read the file.)
Ah, this would be expected, given the current restrictions on metadata
modifications in parallel. Basically, as John describes below, all changes to
metadata must be identical on all processes.
> I had a similar problem with a H5TBappend_records (for HDF5 table) when only
> the master process
> wrotes the table (eg. via if (rank == 0) {} ) I got a similar error message
> concerning dirty caches, so
> when I want to write e.g. the total number of particles to the table I have
> to use MPI_Allreduce(... MPI_SUM ..) instead of only MPI_reduce(0, MPI_SUM)
> which would be more efficient and then let all processes call
> H5TBappend_records.
Similar issue - appending records changes the metadata about the
dataset.
> You know of a way to tell pHDF5 that only the master process the master
> process can write the to the table (e.g. something similar to H5Sselect_none
> ?)
For this case, I think you should be able to call H5Dset_extent() in
the other processes, to extend the table dataset.
Quincey
> thanks again & best wishes,
>
> Paul Hilscher
>
>
> On Wed, Apr 21, 2010 at 5:51 AM, John R. Mainzer <[email protected]> wrote:
> Hi Paul,
>
> From the error message you provided, I think I can tell you
> the proximate cause of the failure.
>
> Briefly, HDF5 maintains metadata consistency across all processes
> by requiring all processes to perform all operations that modify
> metadata collectively so that all processes see the same stream of
> dirty metadata.
>
> This in turn allows us to use only the process zero metadata
> cache to write dirty metadata to file -- all other processes are
> required to hold dirty metadata in cache until informed by the
> process 0 metadata cache that the piece of dirty metdata in question
> has been written to cache and is now clean.
>
> As a sanity check, the non process 0 metadata caches verify that
> all the entries listed in a "these entries are now clean" message
> are both in cache and marked dirty upon receipt of the message.
>
> It is this sanity check that is failing and causing your crash
> on shutdown. It implies that process 0 thinks that some piece of
> metadata is dirty, but at least one other process thinks the entry
> is clean.
>
> I can think of two ways for this to happen:
>
> 1) a bug in the HDF5 library.
>
> 2) a user program that either:
>
> a) makes a library call that modifies metadata on
> some but not all processes, or
>
> b) makes library calls that modify metadata on all processes
> but in different order on different processes.
>
> For a list of library calls that must be called collectively, please
> see:
>
> http://www.hdfgroup.org/HDF5/faq/parallel-apis.html#coll
>
> Unless the above points to an obvious solution, please send us
> the sample code that Elena mentioned. If there is a bug here, I'd
> like to squash it.
>
> Best regards,
>
> John Mainzer
>
> >From [email protected] Tue Apr 20 08:50:50 2010
> >From: Elena Pourmal <[email protected]>
> >Date: Tue, 20 Apr 2010 08:52:59 -0500
> >To: HDF Users Discussion List <[email protected]>
> >Subject: Re: [Hdf-forum] Infinite closing loop with (parallel) HDF-1.8.4-1
> >Reply-To: HDF Users Discussion List <[email protected]>
> >
> >Paul,
> >
> >Any chance you can provide us with the example code that demonstrates the
> >problem? If so, could you please mail it to [email protected]? We will
> >enter a bug report and will take a look. It will also help if you can
> >indicate OS, compiler version and MPI I/O version.
> >
> >Thank you!
> >
> >Elena
> >
> >
> >On Apr 20, 2010, at 8:29 AM, Paul Hilscher wrote:
> >
> >> Dear all,
> >>
> >> I have tried to fix this following problem since more than 3 months but
> >> still did not succeeded, I hope
> >> some of you gurus could help me out.
> >>
> >> I am using HDF5 to store my results from a plasma turbulence code
> >> (basically 6-D and 3-D data,
> >> and a table (to store several scalar data). In a single CPU run, HDF5
> >> (and parallel HDF5) works fine
> >> but for a larger CPU number (and large amount of data output steps) I got
> >> the following error message
> >> at the end of the simulation when I want to close the HDF5 file :
> >>
> >>
> >> ********* snip ****
> >>
> >> HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) MPI-process 24:
> >> #000: H5F.c line 1956 in H5Fclose(): decrementing file ID failed
> >> major: Object atom
> >> minor: Unable to close file
> >> #001: H5F.c line 1756 in H5F_close(): can't close file
> >> major: File accessability
> >> minor: Unable to close file
> >> #002: H5F.c line 1902 in H5F_try_close(): unable to flush cache
> >> major: Object cache
> >> minor: Unable to flush data from cache
> >> #003: H5F.c line 1681 in H5F_flush(): unable to flush metadata cache
> >> major: Object cache
> >> minor: Unable to flush data from cache
> >> #004: H5AC.c line 950 in H5AC_flush(): Can't flush.
> >> major: Object cache
> >> minor: Unable to flush data from cache
> >> #005: H5AC.c line 4695 in H5AC_flush_entries(): Can't propagate clean
> >> entries list.
> >> major: Object cache
> >> minor: Unable to flush data from cache
> >> #006: H5AC.c line 4450 in
> >> H5AC_propagate_flushed_and_still_clean_entries_list(): Can't receive
> >> and/or process clean slist broadcast.
> >> major: Object cache
> >> minor: Internal error detected
> >> #007: H5AC.c line 4595 in H5AC_receive_and_apply_clean_list(): Can't
> >> mark entries clean.
> >> major: Object cache
> >> minor: Internal error detected
> >> #008: H5C.c line 5150 in H5C_mark_entries_as_clean(): Listed entry not
> >> in cache?!?!?.
> >> major: Object cache
> >> minor: Internal error detected
> >> ^[[0mHDF5: infinite loop closing library
> >>
> >> D,G,A,S,T,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
> >>
> >>
> >> ****** snap ***
> >>
> >> I get this error message deterministically, if I increase the data output
> >> frequency, (or CPU number). Finally I cannot open
> >> this file anymore, because HDF5 complains it is corrupted (sure, because
> >> it was not probably closed).
> >> I get the same error on different computers ( with different environment,
> >> e.g. compiler, openmpi library, distribution).
> >> Any Idea to fix this problem is highly appreciated.
> >>
> >>
> >> Thanks for your help & time
> >>
> >> Paul
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> Hdf-forum is for HDF software users discussion.
> >> [email protected]
> >> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> >
> >
> >_______________________________________________
> >Hdf-forum is for HDF software users discussion.
> >[email protected]
> >http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org