Hi,

I have (yet) another problem with the HDF5 library. I am trying to write some data in parallel to a file, where each process writes it's data to it's own dataset. The datasets are first created (as collective operations), and then H5Dwrite hangs when the data are to be written. No error messages are printed, the processes just hangs. I have used GDB on the hanging processes (all processes), and confirmed that it is actually H5Dwrite that hangs.

The strange thing is that this does not always happen, sometimes it works fine. To make it even stranger, it seems that the probability of failure increases with increased problem size and number of processes (or is that really strange?). This writes are in a time-loop, and sometimes a few steps finishes before one write hangs.

I have also found out that if I set the transfer mode to H5FD_MPIO_INDEPENDENT it seems that everything is working fine.

I have tried this on two computers, one workstation and one cluster. The workstation uses OpenMPI with HDF5 1.8.4 and the cluster uses SGI's MPT-MPI with HDF5 1.8.7. Based on the completely different MPI packages and systems, I think MPI and other system issues can be ruled out. The resulting sources of error is then my code (probably) and HDF5 (not so sure about that).

I have attached an example code that shows how I am doing the HDF5-stuff. Unfortunately it is not runnable, but at least you can see how I create and write to the dataset.

Thanks in advance for all help.

Best regards,
Håkon Strandenes
// Set up file access property list with parallel I/O access
hid_t plistID = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plistID, MPI_COMM_WORLD, MPI_INFO_NULL);

// Create a new file collectively.
fileID = H5Fcreate
    (
        dataFile,
        H5F_ACC_TRUNC, 
        H5P_DEFAULT,
        plistID
    );

// Close the property list
H5Pclose(plistID);


// Create the different datasets (needs to be done collectively)
char datasetName[80];
hsize_t dimsf[2];
hid_t fileSpace;
hid_t dsetID;
hid_t plistID;

// The forAll is a macro, just looping over all processes, and nPoints is an
// array with the number of points the dataset is going to write
forAll(nPoints, proc)
{
    
    // Create the dataspace for the dataset
    dimsf[0] = nPoints[proc];
    dimsf[1] = 3;
    fileSpace = H5Screate_simple(2, dimsf, NULL);
    
    // Set property to create parent groups as neccesary
    plistID = H5Pcreate(H5P_LINK_CREATE);
    H5Pset_create_intermediate_group(plistID, 1);
    
    // Create the dataset for points
    sprintf
        (
            datasetName,
            "MESH/%s/processor%i/POINTS",
            mesh_.time().timeName().c_str(),
            proc
        );
    
    dsetID = H5Dcreate2
        (
            fileID,
            datasetName,
            H5T_SCALAR,
            fileSpace,
            plistID,
            H5P_DEFAULT,
            H5P_DEFAULT
        );
    H5Dclose(dsetID);
    H5Pclose(plistID);
    H5Sclose(fileSpace);
}


// Open correct dataset for this process
sprintf
    (
        datasetName,
        "MESH/%s/processor%i/POINTS",
        mesh_.time().timeName().c_str(),
        Pstream::myProcNo()
    );
dsetID = H5Dopen2(fileID, datasetName, H5P_DEFAULT);


// Create property list for collective dataset write.
plistID = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(plistID, H5FD_MPIO_COLLECTIVE);


// Do the actual write
H5Dwrite
    (
        dsetID, 
        H5T_SCALAR,
        H5S_ALL,
        H5S_ALL,
        plistID,
        pointList
    );

// Close/release resources.
H5Dclose(dsetID);
H5Pclose(plistID);


// Close the file.
H5Fclose(fileID);
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to