Hi Andy,

thanks for the hint! The cluster I am working on currently runs 1.6.7.1
(from /proc/fs/lustre/version). The current solution of minimizing the
addition of data to existing files seems to be working fine, although it
is a bit inefficient. May be after the next update it will work again. I
can only hope they don't upgrade to 1.8.4 :)

bye, Peter


On 11/22/2012 02:06 AM, Salnikov, Andrei A. wrote:
> Hi Peter,
>
> which version of Lustre are you using? We once observed a very strange
> corruption happening when we wrote HDF5 files to Lustre. That was seen
> with Lustre client versions 1.8.4. After we switched to 1.8.7 the 
> problem had disappeared.
>
> Cheers,
> Andy
>
>
>> -----Original Message-----
>> From: Hdf-forum [mailto:[email protected]] On Behalf Of Peter
>> Boertz
>> Sent: Wednesday, November 21, 2012 12:46 AM
>> Subject: Re: [Hdf-forum] Problems with HDF5 on a Lustre filesystem
>>
>> Hi Mohamad,
>>
>> thanks again for your help! Once the program failed, the structure was
>> corrupted and h5dump was unable to read the file. Any other file which
>> was also opened at the time but not by the crashing node survived the
>> crash intact. I have rewritten the dumping part and moved from multiple
>> classes writing to the same file to having one class that stores all
>> variables and dumps on demand which appears to work better (tested with
>> 192 nodes). In the preceding approach, I opened and closed the file a
>> lot of times to add just one variable or an array of variables, may be
>> this was the problem? Although I still find it weird that it worked
>> flawlessly on my personal computer ...
>>
>> Peter
>>
>>
>> On 11/20/2012 03:29 PM, Mohamad Chaarawi wrote:
>>> Hi Peter,
>>>
>>> Yes nothing seems unusual for me in the case that each process
>>> accesses its own file.
>>> Since you mentioned that this works on your local filesystem, did you
>>> try and check the structure of the files and make sure that they are
>>> correct their (using h5dump or other tools)?
>>> Otherwise I'm not sure what could be wrong. I'm not familiar with C++
>>> either, so someone else could also have other comments.
>>>
>>> Mohamad
>>>
>>> On 11/19/2012 11:11 AM, Peter Boertz wrote:
>>>> Hi Mohamad,
>>>>
>>>> thanks for your reply. The reason I suspected Lustre of being the
>>>> culprit is simply that the error does not appear on my personal
>>>> computer. I thought that maybe the files are written/opened too fast or
>>>> too many at the same time for the synchronization process of Lustre to
>>>> handle.
>>>>
>>>> I am inserting various pieces of code that show how I am calling the
>>>> HDF5 library. Any comment on proper ways of doing so is much
>>>> appreciated!
>>>>
>>>> To open the file, I use the following code:
>>>>
>>>>
>>>> ==================================================================
>>>>
>>>> int H5Interface::OpenFile (std::string filename, int flag) {
>>>>
>>>>      bool tried_once = false;
>>>>
>>>>      struct timespec timesp;
>>>>      timesp.tv_sec = 0;
>>>>      timesp.tv_nsec = 200000000;
>>>>
>>>>      for (int tries = 0; tries < 300; tries++) {
>>>>          try {
>>>>              H5::Exception::dontPrint();
>>>>              if(flag == 0) {
>>>>                  file = H5::H5File (filename, H5F_ACC_TRUNC);
>>>>              } else if (flag == 1) {
>>>>                  file.openFile(filename, H5F_ACC_RDONLY);
>>>>              } else if (flag == 2) {
>>>>                  file.openFile(filename, H5F_ACC_RDWR);
>>>>              }
>>>>
>>>>              if (tried_once) {
>>>>                  std::cout << "Opening " << filename << " succeded
>>>> after "
>>>>                            << tries << " several tries" << std::endl;
>>>>              }
>>>>              return 0;
>>>>
>>>>          } catch( FileIException error ) {
>>>>              tried_once = true;
>>>>          }
>>>>
>>>>          catch( DataSetIException error ) {
>>>>              tried_once = true;
>>>>          }
>>>>
>>>>          catch( DataSpaceIException error ) {
>>>>              tried_once = true;
>>>>          }
>>>>          nanosleep(&timesp, NULL);
>>>>      }
>>>>      std::cerr << "H5Interface:\tOpening " << filename << " failed";
>>>>      return -1;
>>>> }
>>>>
>>>> It often happens that opening a file succeeds only after 1 or 2 tries.
>>>>
>>>> I write and read strings like this:
>>>>
>>>>
>>>> ==================================================================
>>>>
>>>> int H5Interface::WriteString(std::string path, std::string value) {
>>>>      try {
>>>>          H5::Exception::dontPrint();
>>>>          H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
>>>>          H5std_string str (value);
>>>>          hsize_t dims[1] = { 1 };
>>>>          H5::DataSpace str_space(uint(1), dims, NULL);
>>>>          H5::DataSet str_set;
>>>>          if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
>>>>              str_set = file.openDataSet(path);
>>>>          } else {
>>>>              str_set = file.createDataSet(path, str_t, str_space);
>>>>          }
>>>>          str_set.write (str, str_t);
>>>>          str_set.close();
>>>>      }
>>>>      catch( FileIException error ) {
>>>>          // error.printError();
>>>>          return -1;
>>>>      }
>>>>
>>>>      catch( DataSetIException error ) {
>>>>          // error.printError();
>>>>          return -1;
>>>>      }
>>>>
>>>>      catch( DataSpaceIException error ) {
>>>>          // error.printError();
>>>>          return -1;
>>>>      }
>>>>      return 0;
>>>> }
>>>>
>>>>
>>>> ==================================================================
>>>>
>>>>
>>>> int H5Interface::ReadString(std::string path, std::string * data) {
>>>>      try {
>>>>      H5::Exception::dontPrint();
>>>>          if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
>>>>              H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
>>>>              H5std_string str;
>>>>              H5::DataSet str_set = file.openDataSet(path);
>>>>              str_set.read (str, str_t);
>>>>              str_set.close();
>>>>              *data = std::string(str);
>>>>          }
>>>>      }
>>>>      catch( FileIException error ) {
>>>>          // error.printError();
>>>>          return -1;
>>>>      }
>>>>
>>>>      catch( DataSetIException error ) {
>>>>          // error.printError();
>>>>          return -1;
>>>>      }
>>>>
>>>>      catch( DataSpaceIException error ) {
>>>>          // error.printError();
>>>>          return -1;
>>>>      }
>>>>      return 0;
>>>> }
>>>>
>>>>
>>>>
>>>> And finally for writing and reading boost::multi_arrays, for example:
>>>>
>>>>
>>>>
>>>> ==================================================================
>>>>
>>>>
>>>> int H5Interface::Read2IntMultiArray(std::string path,
>>>>                                      boost::multi_array<int,2>& data) {
>>>>      try {
>>>>          H5::DataSet v_set = file.openDataSet(path);
>>>>          H5::DataSpace space = v_set.getSpace();
>>>>          hsize_t dims[2];
>>>>
>>>>          int rank = space.getSimpleExtentDims( dims );
>>>>
>>>>          DataSpace mspace(rank, dims);
>>>>          int data_out[dims[0]][dims[1]];
>>>>          data.resize(boost::extents[dims[0]][dims[1]]);
>>>>          v_set.read( data_out, PredType::NATIVE_INT, mspace, space );
>>>>          for (int i = 0; i < int(dims[0]); i++) {
>>>>              for (int j = 0; j < int(dims[1]); j++) {
>>>>                  data[i][j] = data_out[i][j];
>>>>              }
>>>>          }
>>>>          v_set.close();
>>>>      }
>>>>      [...]
>>>>
>>>>
>>>> ==================================================================
>>>>
>>>>
>>>> int H5Interface::WriteIntMatrix(std::string path, uint rows,
>>>>                                   uint cols, int * data) {
>>>>      try {
>>>>          H5::Exception::dontPrint();
>>>>      hsize_t dims_m[2] = { rows, cols };
>>>>          H5::DataSpace v_space (2, dims_m);
>>>>      H5::DataSet v_set;
>>>>      if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
>>>>              v_set = file.openDataSet(path);
>>>>          } else {
>>>>              v_set = file.createDataSet(path, H5::PredType::NATIVE_INT,
>>>> v_space);
>>>>          }
>>>>          v_set.write(data, H5::PredType::NATIVE_INT);
>>>>          v_set.close();
>>>>      }
>>>>      [...]
>>>>
>>>>
>>>>
>>>> As far as the workflow goes, a scheduler provides the basic h5 file
>> with
>>>> all the parameters and tells the workers to load this file and then put
>>>> their measurements in. So they are enlarging the file as time goes by.
>>>>
>>>> Have a nice day, Peter
>>>>
>>>>
>>>>
>>>> On 11/19/2012 03:36 PM, Mohamad Chaarawi wrote:
>>>>> Hi Peter,
>>>>>
>>>>> The problem does sound strange.
>>>>> I do not understand why file locking helped reduce errors. I though
>>>>> you said each process writes to its own file anyway, so locking the
>>>>> file or having one process manage the reads/writes should not matter
>>>>> anyway.
>>>>>
>>>>> Is it possible you could send me a piece of code from your simulation
>>>>> that is performing I/O, that I can look at and diagnose further?
>>>>> A program that I can run and replicates the problem (on Lustre) would
>>>>> be great. If that is not possible, then please just describe or
>>>>> copy-paste how you are calling into the HDF5 library for your I/O.
>>>>>
>>>>> Thanks,
>>>>> Mohamad
>>>>>
>>>>> On 11/18/2012 10:24 AM, Peter Boertz wrote:
>>>>>> Hello everyone,
>>>>>>
>>>>>> I run simulations on a cluster (using OpenMPI) with a Lustre
>>>>>> filesystem
>>>>>> and I use HDF5 1.8.9 for data output. Each process has its own
>>>>>> file, so
>>>>>> I believe there is no need for the parallel HDF5 version, is this
>>>>>> correct?
>>>>>>
>>>>>> When a larger number (> 4) processes want to dump their data at the
>>>>>> same
>>>>>> time, I get various errors of paths and objects not found or any
>> other
>>>>>> operation failing. I can't really make out the reason for it, as the
>>>>>> code works fine on my personal workstation and runs for days with
>>>>>> writes
>>>>>> / reads every 5 minutes without failing.
>>>>>>
>>>>>> What I have tried so far is having one process manage all the
>>>>>> read/write
>>>>>> operations so that all other processes have to check whether anyone
>>>>>> else
>>>>>> is already dumping their data. I also implemented
>>>>>> boost::interprocess:file_lock to prevent writing in the same file,
>>>>>> which
>>>>>> is however excluded by the queuing system anyway, so this was more
>>>>>> of a
>>>>>> paranoid move to be absolutely sure. All that helped reducing the
>>>>>> number
>>>>>> fatal errors significantly, but did not completely get rid of them.
>>>>>> The
>>>>>> biggest problem is, that some of the files get corrupted when the
>>>>>> program crashes which is especially inconvenient.
>>>>>>
>>>>>> My question is, if there is any obvious mistake I am making and how I
>>>>>> would go about solving this issue. My initial guess is that the
>> Lustre
>>>>>> filesystem plays some role in this, since it is the only difference
>> to
>>>>>> my personal computer where everything runs smoothly. As I said,
>>>>>> neither
>>>>>> the errors messages nor the traceback show any consistency.
>>>>>>
>>>>>> bye, Peter
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [email protected]
>>>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [email protected]
>>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [email protected]
>>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to