I think the problem may be that you are trying to execute a collective write to separate datasets. That would explain why collective hangs and independent succeeds.
I am a bit rusty on HDF5's parallel I/O semantics but AFAIK, a collective write can only be done to the same dataset. That does NOT mean each processor has to have an identical dsetID (e.g. memcmp(&proc_1_dsetID, &proc_2_dsetID, sizeof(hid_t)) may be nonzero) but it does mean the dataset object to which each processor's dsetID references in the file has to be the same. In other words the name (or path) of the dataset used in the create/open call needs to have been the same. To issue writes to different datasets simultaneously in parallel, I think you're only option is independent. I wonder if your aiming to do collective to different datasets because you expect that collective will be more easily 'coordinated' by the underlying filsystem and therefore has a higher chance at better performance than independent. If so, I don't know if that very often turns out to be true/possible in practice. I hope others with a little more parallel I/O experience might chime in ;) Mark On Sat, 2012-10-13 at 10:48 +0200, Håkon Strandenes wrote: > Hi, > > I have (yet) another problem with the HDF5 library. I am trying to write > some data in parallel to a file, where each process writes it's data to > it's own dataset. The datasets are first created (as collective > operations), and then H5Dwrite hangs when the data are to be written. No > error messages are printed, the processes just hangs. I have used GDB on > the hanging processes (all processes), and confirmed that it is actually > H5Dwrite that hangs. > > The strange thing is that this does not always happen, sometimes it > works fine. To make it even stranger, it seems that the probability of > failure increases with increased problem size and number of processes > (or is that really strange?). This writes are in a time-loop, and > sometimes a few steps finishes before one write hangs. > > I have also found out that if I set the transfer mode to > H5FD_MPIO_INDEPENDENT it seems that everything is working fine. > > I have tried this on two computers, one workstation and one cluster. The > workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's > MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages > and systems, I think MPI and other system issues can be ruled out. The > resulting sources of error is then my code (probably) and HDF5 (not so > sure about that). > > I have attached an example code that shows how I am doing the > HDF5-stuff. Unfortunately it is not runnable, but at least you can see > how I create and write to the dataset. > > Thanks in advance for all help. > > Best regards, > Håkon Strandenes > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org -- Mark C. Miller, Lawrence Livermore National Laboratory ================!!LLNL BUSINESS ONLY!!================ [email protected] urgent: [email protected] T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511 _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
