Re: [Hdf-forum] Infinite closing loop with (parallel) HDF-1.8.4-1

John R. Mainzer Tue, 20 Apr 2010 13:52:26 -0700

Hi Paul,

   From the error message you provided, I think I can tell you 
the proximate cause of the failure.


   Briefly, HDF5 maintains metadata consistency across all processes
by requiring all processes to perform all operations that modify 
metadata collectively so that all processes see the same stream of 
dirty metadata.

   This in turn allows us to use only the process zero metadata 
cache to write dirty metadata to file -- all other processes are 
required to hold dirty metadata in cache until informed by the 
process 0 metadata cache that the piece of dirty metdata in question
has been written to cache and is now clean.  

   As a sanity check, the non process 0 metadata caches verify that 
all the entries listed in a "these entries are now clean" message
are both in cache and marked dirty upon receipt of the message.

   It is this sanity check that is failing and causing your crash 
on shutdown.  It implies that process 0 thinks that some piece of 
metadata is dirty, but at least one other process thinks the entry
is clean.

   I can think of two ways for this to happen:

   1) a bug in the HDF5 library.

   2) a user program that either:

      a) makes a library call that modifies metadata on 
         some but not all processes, or 

      b) makes library calls that modify metadata on all processes 
         but in different order on different processes.

For a list of library calls that must be called collectively, please 
see:

        http://www.hdfgroup.org/HDF5/faq/parallel-apis.html#coll

   Unless the above points to an obvious solution, please send us 
the sample code that Elena mentioned.  If there is a bug here, I'd 
like to squash it.

                                     Best regards,

                                     John Mainzer

>From [email protected] Tue Apr 20 08:50:50 2010
>From: Elena Pourmal <[email protected]>
>Date: Tue, 20 Apr 2010 08:52:59 -0500
>To: HDF Users Discussion List <[email protected]>
>Subject: Re: [Hdf-forum] Infinite closing loop with (parallel) HDF-1.8.4-1
>Reply-To: HDF Users Discussion List <[email protected]>
>
>Paul,
>
>Any chance you can provide us with the example code that demonstrates the 
>problem? If so, could you please mail it to [email protected]? We will 
>enter a bug report and will take a look. It will also help if you can 
>indicate OS, compiler version and MPI I/O version.
>
>Thank you!
>
>Elena
>
>
>On Apr 20, 2010, at 8:29 AM, Paul Hilscher wrote:
>
>> Dear all, 
>> 
>> I have tried to fix this following problem since more than 3 months but 
>> still did not succeeded, I hope
>> some of you gurus could help me out.
>> 
>> I am using HDF5 to store my results from a plasma turbulence code (basically 
>> 6-D and 3-D data,
>>  and a table (to store several scalar data). In a single CPU run, HDF5 (and 
>> parallel HDF5) works fine
>> but for a larger CPU number (and large amount of data output steps) I got 
>> the following error message
>> at the end of the simulation when I want to close the HDF5 file : 
>> 
>> 
>> *********  snip ****
>> 
>> HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) MPI-process 24:
>>   #000: H5F.c line 1956 in H5Fclose(): decrementing file ID failed
>>     major: Object atom
>>     minor: Unable to close file
>>   #001: H5F.c line 1756 in H5F_close(): can't close file
>>     major: File accessability
>>     minor: Unable to close file
>>   #002: H5F.c line 1902 in H5F_try_close(): unable to flush cache
>>     major: Object cache
>>     minor: Unable to flush data from cache
>>   #003: H5F.c line 1681 in H5F_flush(): unable to flush metadata cache
>>     major: Object cache
>>     minor: Unable to flush data from cache
>>   #004: H5AC.c line 950 in H5AC_flush(): Can't flush.
>>     major: Object cache
>>     minor: Unable to flush data from cache
>>   #005: H5AC.c line 4695 in H5AC_flush_entries(): Can't propagate clean 
>> entries list.
>>     major: Object cache
>>     minor: Unable to flush data from cache
>>   #006: H5AC.c line 4450 in 
>> H5AC_propagate_flushed_and_still_clean_entries_list(): Can't receive and/or 
>> process clean slist broadcast.
>>     major: Object cache
>>     minor: Internal error detected
>>   #007: H5AC.c line 4595 in H5AC_receive_and_apply_clean_list(): Can't mark 
>> entries clean.
>>     major: Object cache
>>     minor: Internal error detected
>>   #008: H5C.c line 5150 in H5C_mark_entries_as_clean(): Listed entry not in 
>> cache?!?!?.
>>     major: Object cache
>>     minor: Internal error detected
>> ^[[0mHDF5: infinite loop closing library
>>       
>> D,G,A,S,T,F,F,AC,FD,P,FD,P,FD,P,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD,FD
>> 
>> 
>> ****** snap ***
>> 
>> I get this error message deterministically, if I increase the data output 
>> frequency, (or CPU number). Finally I cannot open
>> this file anymore, because HDF5 complains it is corrupted (sure, because it 
>> was not probably closed). 
>> I get the same error on different computers ( with different environment, 
>> e.g. compiler, openmpi library, distribution).
>> Any Idea to fix this problem is highly appreciated.
>> 
>> 
>> Thanks for your help & time
>> 
>> Paul
>>  
>> 
>> 
>> 
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
>_______________________________________________
>Hdf-forum is for HDF software users discussion.
>[email protected]
>http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Infinite closing loop with (parallel) HDF-1.8.4-1

Reply via email to