Hi Mark,

thanks for your reply.

I try to be very careful about closing everything I open, and so I think
I can answer your first question with a "no". It also seems unlikely as
this would occur for 8 processes as well as for 5. There's no errors
when the program terminates (for np = 4, say), or when it deadlocks.

The problem occurs both on an actually parallel file system (gpfs,
lustre) and on a "normal" filesystem (Btrfs, I believe). I am not making
any system calls, nor do modify or stat the filesystem (outside of the
file creation that happens via HDF5), as can be seen in the demonstrator
that was attached to the original email.

(did that file get scrubbed? If it WAS with the posting, could somebody
try to run it for, say, np 3, and see whether the error is reproducible
on their system?)

I have tried looking at the hangup in totalview, and the deadlock
actually occurs within H5Fclose.
H5Fclose -> H5I_dec_app_ref -> H5Fclose -> H5F_try_close -> H5F_dest ->
H5FD_truncate -> H5FD_mpio_truncate -> PMPI_Barrier -> ...

I'm not quite sure how to run valgrind on an mpi-enabled process, but
I'll try to find out.

Again, most errors I can think of would also happen if there's 2 or 4 or
8 processors, not only for 3, 5, 6, 7, 9, ...

The only obvious difference is that the number of elements written by
each process is different if np != power of 2. In the case that it
actually WORKS, each process writes the exact same number of elements to
the file. But that shouldn't actually be a problem...

Cheers,
Wolf


> Miller, Mark C. wrote:
>
> Some things to watch out for. . .
>
> Are you by chance accidentally leaving one or more objects in the
> file 'open' (e.g. did you forget some H5Xclose() call somewhere). I
> cannot atest to that causing actual hangs in H5Fclose but I know HDF
> has some logic to detect possible infinite loop in sym-link/group
> structure for which it sometimes actually outputs a message along the
> lines of "…infinite loop detected while closing file 'foo.h5' . . .".
> i sometimes wind up using H5Fget_obj_count just prior to H5Fclose to
> try to debug this when it (occasionally) has happend for me.
>
> You say you are running in parallel. Is the file on an actual
> parallel filesystem? Are you by chance mucking with the filesystem's
> metadata via calls to stat or mkdir or chdir at any time before or
> after your create or close the HDF5 file? If so, are you ensuring
> parallel sync. via MPI_barrier before proceeding after such calls?
>
> The core counts you mention are small so you might be able to
> raise(SIGSTOP) just before H5Fclose and then gdb (or totalview) to
> several of the processes to see whats happening. Likewise, you mght
> be able to run valgrind on each process (sending output to separate
> files) to help debug too.
>
> Sorry I don't have any other ideas. Good luck.
>
> Mark


>> Date: Tue, 07 Apr 2015 18:30:14 +0200 From: Wolf Dapp
>> <[email protected]> To: [email protected] Subject:
>> [Hdf-forum] parallel HDF5: H5Fclose hangs when not using a power of 2
>> number of processes Message-ID: <[email protected]>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Dear hdf-forum members,
>>
>> I have a problem I am hoping someone can help me with. I have a
>> program that outputs a 2D-array (contiguous, indexed linearly) using
>> parallel HDF5. When I choose a number of processors that is not a
>> power of 2 (1,2,4,8,...) H5Fclose() hangs, inexplicably. I'm using
>> HDF5 v.1.8.14, and OpenMPI 1.7.2, on top of GCC 4.8 with Linux.
>>
>> Can someone help me pinpoint my mistake?
>>
>> I have searched the forum, and the first hit [searching for
>> "h5fclose hangs"] was a user mistake that I didn't make (to the best
>> of my knowledge). The second didn't go on beyond the initial problem
>> description, and didn't offer a solution.
>>
>> Attached is a (maybe insufficiently bare-boned, apologies)
>> demonstrator program. Strangely, the hang only happens if nx >= 32.
>> The code is adapted from an HDF5 example program.
>>
>> The demonstrator is compiled with h5pcc test.hangs.cpp -DVERBOSE
>> -lstdc++
>>
>> ( on my system, for some strange reason, MPI has been compiled with
>> the deprecated C++ bindings. I need to include -lmpi_cxx also, but
>> that shouldn't be necessary for anyone else. I hope that's not the
>> reason for the hang-ups. )
>>
>> Thanks in advance for your help!
>>
>> Wolf Dapp

-- 



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to