Hi Mohamad,

thanks for your reply. The reason I suspected Lustre of being the
culprit is simply that the error does not appear on my personal
computer. I thought that maybe the files are written/opened too fast or
too many at the same time for the synchronization process of Lustre to
handle.

I am inserting various pieces of code that show how I am calling the
HDF5 library. Any comment on proper ways of doing so is much appreciated!

To open the file, I use the following code:


==================================================================

int H5Interface::OpenFile (std::string filename, int flag) {

    bool tried_once = false;

    struct timespec timesp;
    timesp.tv_sec = 0;
    timesp.tv_nsec = 200000000;

    for (int tries = 0; tries < 300; tries++) {
        try {
            H5::Exception::dontPrint();
            if(flag == 0) {
                file = H5::H5File (filename, H5F_ACC_TRUNC);
            } else if (flag == 1) {
                file.openFile(filename, H5F_ACC_RDONLY);
            } else if (flag == 2) {
                file.openFile(filename, H5F_ACC_RDWR);
            }

            if (tried_once) {
                std::cout << "Opening " << filename << " succeded after "
                          << tries << " several tries" << std::endl;
            }
            return 0;

        } catch( FileIException error ) {
            tried_once = true;
        }

        catch( DataSetIException error ) {
            tried_once = true;
        }

        catch( DataSpaceIException error ) {
            tried_once = true;
        }
        nanosleep(&timesp, NULL);
    }
    std::cerr << "H5Interface:\tOpening " << filename << " failed";
    return -1;
}

It often happens that opening a file succeeds only after 1 or 2 tries.

I write and read strings like this:


==================================================================

int H5Interface::WriteString(std::string path, std::string value) {
    try {
        H5::Exception::dontPrint();
        H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
        H5std_string str (value);
        hsize_t dims[1] = { 1 };
        H5::DataSpace str_space(uint(1), dims, NULL);
        H5::DataSet str_set;
        if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
            str_set = file.openDataSet(path);
        } else {
            str_set = file.createDataSet(path, str_t, str_space);
        }
        str_set.write (str, str_t);
        str_set.close();
    }
    catch( FileIException error ) {
        // error.printError();
        return -1;
    }

    catch( DataSetIException error ) {
        // error.printError();
        return -1;
    }

    catch( DataSpaceIException error ) {
        // error.printError();
        return -1;
    }
    return 0;
}


==================================================================


int H5Interface::ReadString(std::string path, std::string * data) {
    try {
    H5::Exception::dontPrint();
        if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
            H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
            H5std_string str;
            H5::DataSet str_set = file.openDataSet(path);
            str_set.read (str, str_t);
            str_set.close();
            *data = std::string(str);
        }
    }
    catch( FileIException error ) {
        // error.printError();
        return -1;
    }

    catch( DataSetIException error ) {
        // error.printError();
        return -1;
    }

    catch( DataSpaceIException error ) {
        // error.printError();
        return -1;
    }
    return 0;
}



And finally for writing and reading boost::multi_arrays, for example:



==================================================================


int H5Interface::Read2IntMultiArray(std::string path,
                                    boost::multi_array<int,2>& data) {
    try {
        H5::DataSet v_set = file.openDataSet(path);
        H5::DataSpace space = v_set.getSpace();
        hsize_t dims[2];

        int rank = space.getSimpleExtentDims( dims );

        DataSpace mspace(rank, dims);
        int data_out[dims[0]][dims[1]];
        data.resize(boost::extents[dims[0]][dims[1]]);
        v_set.read( data_out, PredType::NATIVE_INT, mspace, space );
        for (int i = 0; i < int(dims[0]); i++) {
            for (int j = 0; j < int(dims[1]); j++) {
                data[i][j] = data_out[i][j];
            }
        }
        v_set.close();
    }
    [...]


==================================================================


int H5Interface::WriteIntMatrix(std::string path, uint rows,
                                 uint cols, int * data) {
    try {
        H5::Exception::dontPrint();
    hsize_t dims_m[2] = { rows, cols };
        H5::DataSpace v_space (2, dims_m);
    H5::DataSet v_set;
    if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
            v_set = file.openDataSet(path);
        } else {
            v_set = file.createDataSet(path, H5::PredType::NATIVE_INT,
v_space);
        }
        v_set.write(data, H5::PredType::NATIVE_INT);
        v_set.close();
    }
    [...]



As far as the workflow goes, a scheduler provides the basic h5 file with
all the parameters and tells the workers to load this file and then put
their measurements in. So they are enlarging the file as time goes by.

Have a nice day, Peter



On 11/19/2012 03:36 PM, Mohamad Chaarawi wrote:
> Hi Peter,
>
> The problem does sound strange.
> I do not understand why file locking helped reduce errors. I though
> you said each process writes to its own file anyway, so locking the
> file or having one process manage the reads/writes should not matter
> anyway.
>
> Is it possible you could send me a piece of code from your simulation
> that is performing I/O, that I can look at and diagnose further?
> A program that I can run and replicates the problem (on Lustre) would
> be great. If that is not possible, then please just describe or
> copy-paste how you are calling into the HDF5 library for your I/O.
>
> Thanks,
> Mohamad
>
> On 11/18/2012 10:24 AM, Peter Boertz wrote:
>> Hello everyone,
>>
>> I run simulations on a cluster (using OpenMPI) with a Lustre filesystem
>> and I use HDF5 1.8.9 for data output. Each process has its own file, so
>> I believe there is no need for the parallel HDF5 version, is this
>> correct?
>>
>> When a larger number (> 4) processes want to dump their data at the same
>> time, I get various errors of paths and objects not found or any other
>> operation failing. I can't really make out the reason for it, as the
>> code works fine on my personal workstation and runs for days with writes
>> / reads every 5 minutes without failing.
>>
>> What I have tried so far is having one process manage all the read/write
>> operations so that all other processes have to check whether anyone else
>> is already dumping their data. I also implemented
>> boost::interprocess:file_lock to prevent writing in the same file, which
>> is however excluded by the queuing system anyway, so this was more of a
>> paranoid move to be absolutely sure. All that helped reducing the number
>> fatal errors significantly, but did not completely get rid of them. The
>> biggest problem is, that some of the files get corrupted when the
>> program crashes which is especially inconvenient.
>>
>> My question is, if there is any obvious mistake I am making and how I
>> would go about solving this issue. My initial guess is that the Lustre
>> filesystem plays some role in this, since it is the only difference to
>> my personal computer where everything runs smoothly. As I said, neither
>> the errors messages nor the traceback show any consistency.
>>
>> bye, Peter
>>
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to