Yes Mark is correct. You program is erroneous.
The current interface for reading and writing to datasets (collectively)
requires all processes to call the operation for each read/write
operation. You can correct your program by having each processes
participate with a NULL selection in the read/write operation, except
for the dataset that belongs to that process, or just use independent I/O.
We are working on a new interface that would allow collective access to
multiple datasets simultaneously, so stay tuned :-)
Thanks,
Mohamad
On 10/13/2012 10:44 AM, Mark Miller wrote:
I think the problem may be that you are trying to execute a collective
write to separate datasets. That would explain why collective hangs and
independent succeeds.
I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a
collective write can only be done to the same dataset. That does NOT
mean each processor has to have an identical dsetID (e.g.
memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero)
but it does mean the dataset object to which each processor's dsetID
references in the file has to be the same. In other words the name (or
path) of the dataset used in the create/open call needs to have been the
same.
To issue writes to different datasets simultaneously in parallel, I
think you're only option is independent.
I wonder if your aiming to do collective to different datasets because
you expect that collective will be more easily 'coordinated' by the
underlying filsystem and therefore has a higher chance at better
performance than independent. If so, I don't know if that very often
turns out to be true/possible in practice.
I hope others with a little more parallel I/O experience might chime
in ;)
Mark
On Sat, 2012-10-13 at 10:48 +0200, Håkon Strandenes wrote:
Hi,
I have (yet) another problem with the HDF5 library. I am trying to write
some data in parallel to a file, where each process writes it's data to
it's own dataset. The datasets are first created (as collective
operations), and then H5Dwrite hangs when the data are to be written. No
error messages are printed, the processes just hangs. I have used GDB on
the hanging processes (all processes), and confirmed that it is actually
H5Dwrite that hangs.
The strange thing is that this does not always happen, sometimes it
works fine. To make it even stranger, it seems that the probability of
failure increases with increased problem size and number of processes
(or is that really strange?). This writes are in a time-loop, and
sometimes a few steps finishes before one write hangs.
I have also found out that if I set the transfer mode to
H5FD_MPIO_INDEPENDENT it seems that everything is working fine.
I have tried this on two computers, one workstation and one cluster. The
workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's
MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages
and systems, I think MPI and other system issues can be ruled out. The
resulting sources of error is then my code (probably) and HDF5 (not so
sure about that).
I have attached an example code that shows how I am doing the
HDF5-stuff. Unfortunately it is not runnable, but at least you can see
how I create and write to the dataset.
Thanks in advance for all help.
Best regards,
Håkon Strandenes
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org