Hi Peter,

which version of Lustre are you using? We once observed a very strange
corruption happening when we wrote HDF5 files to Lustre. That was seen
with Lustre client versions 1.8.4. After we switched to 1.8.7 the 
problem had disappeared.

Cheers,
Andy


> -----Original Message-----
> From: Hdf-forum [mailto:[email protected]] On Behalf Of Peter
> Boertz
> Sent: Wednesday, November 21, 2012 12:46 AM
> Subject: Re: [Hdf-forum] Problems with HDF5 on a Lustre filesystem
> 
> Hi Mohamad,
> 
> thanks again for your help! Once the program failed, the structure was
> corrupted and h5dump was unable to read the file. Any other file which
> was also opened at the time but not by the crashing node survived the
> crash intact. I have rewritten the dumping part and moved from multiple
> classes writing to the same file to having one class that stores all
> variables and dumps on demand which appears to work better (tested with
> 192 nodes). In the preceding approach, I opened and closed the file a
> lot of times to add just one variable or an array of variables, may be
> this was the problem? Although I still find it weird that it worked
> flawlessly on my personal computer ...
> 
> Peter
> 
> 
> On 11/20/2012 03:29 PM, Mohamad Chaarawi wrote:
> > Hi Peter,
> >
> > Yes nothing seems unusual for me in the case that each process
> > accesses its own file.
> > Since you mentioned that this works on your local filesystem, did you
> > try and check the structure of the files and make sure that they are
> > correct their (using h5dump or other tools)?
> > Otherwise I'm not sure what could be wrong. I'm not familiar with C++
> > either, so someone else could also have other comments.
> >
> > Mohamad
> >
> > On 11/19/2012 11:11 AM, Peter Boertz wrote:
> >> Hi Mohamad,
> >>
> >> thanks for your reply. The reason I suspected Lustre of being the
> >> culprit is simply that the error does not appear on my personal
> >> computer. I thought that maybe the files are written/opened too fast or
> >> too many at the same time for the synchronization process of Lustre to
> >> handle.
> >>
> >> I am inserting various pieces of code that show how I am calling the
> >> HDF5 library. Any comment on proper ways of doing so is much
> >> appreciated!
> >>
> >> To open the file, I use the following code:
> >>
> >>
> >> ==================================================================
> >>
> >> int H5Interface::OpenFile (std::string filename, int flag) {
> >>
> >>      bool tried_once = false;
> >>
> >>      struct timespec timesp;
> >>      timesp.tv_sec = 0;
> >>      timesp.tv_nsec = 200000000;
> >>
> >>      for (int tries = 0; tries < 300; tries++) {
> >>          try {
> >>              H5::Exception::dontPrint();
> >>              if(flag == 0) {
> >>                  file = H5::H5File (filename, H5F_ACC_TRUNC);
> >>              } else if (flag == 1) {
> >>                  file.openFile(filename, H5F_ACC_RDONLY);
> >>              } else if (flag == 2) {
> >>                  file.openFile(filename, H5F_ACC_RDWR);
> >>              }
> >>
> >>              if (tried_once) {
> >>                  std::cout << "Opening " << filename << " succeded
> >> after "
> >>                            << tries << " several tries" << std::endl;
> >>              }
> >>              return 0;
> >>
> >>          } catch( FileIException error ) {
> >>              tried_once = true;
> >>          }
> >>
> >>          catch( DataSetIException error ) {
> >>              tried_once = true;
> >>          }
> >>
> >>          catch( DataSpaceIException error ) {
> >>              tried_once = true;
> >>          }
> >>          nanosleep(&timesp, NULL);
> >>      }
> >>      std::cerr << "H5Interface:\tOpening " << filename << " failed";
> >>      return -1;
> >> }
> >>
> >> It often happens that opening a file succeeds only after 1 or 2 tries.
> >>
> >> I write and read strings like this:
> >>
> >>
> >> ==================================================================
> >>
> >> int H5Interface::WriteString(std::string path, std::string value) {
> >>      try {
> >>          H5::Exception::dontPrint();
> >>          H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
> >>          H5std_string str (value);
> >>          hsize_t dims[1] = { 1 };
> >>          H5::DataSpace str_space(uint(1), dims, NULL);
> >>          H5::DataSet str_set;
> >>          if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
> >>              str_set = file.openDataSet(path);
> >>          } else {
> >>              str_set = file.createDataSet(path, str_t, str_space);
> >>          }
> >>          str_set.write (str, str_t);
> >>          str_set.close();
> >>      }
> >>      catch( FileIException error ) {
> >>          // error.printError();
> >>          return -1;
> >>      }
> >>
> >>      catch( DataSetIException error ) {
> >>          // error.printError();
> >>          return -1;
> >>      }
> >>
> >>      catch( DataSpaceIException error ) {
> >>          // error.printError();
> >>          return -1;
> >>      }
> >>      return 0;
> >> }
> >>
> >>
> >> ==================================================================
> >>
> >>
> >> int H5Interface::ReadString(std::string path, std::string * data) {
> >>      try {
> >>      H5::Exception::dontPrint();
> >>          if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
> >>              H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
> >>              H5std_string str;
> >>              H5::DataSet str_set = file.openDataSet(path);
> >>              str_set.read (str, str_t);
> >>              str_set.close();
> >>              *data = std::string(str);
> >>          }
> >>      }
> >>      catch( FileIException error ) {
> >>          // error.printError();
> >>          return -1;
> >>      }
> >>
> >>      catch( DataSetIException error ) {
> >>          // error.printError();
> >>          return -1;
> >>      }
> >>
> >>      catch( DataSpaceIException error ) {
> >>          // error.printError();
> >>          return -1;
> >>      }
> >>      return 0;
> >> }
> >>
> >>
> >>
> >> And finally for writing and reading boost::multi_arrays, for example:
> >>
> >>
> >>
> >> ==================================================================
> >>
> >>
> >> int H5Interface::Read2IntMultiArray(std::string path,
> >>                                      boost::multi_array<int,2>& data) {
> >>      try {
> >>          H5::DataSet v_set = file.openDataSet(path);
> >>          H5::DataSpace space = v_set.getSpace();
> >>          hsize_t dims[2];
> >>
> >>          int rank = space.getSimpleExtentDims( dims );
> >>
> >>          DataSpace mspace(rank, dims);
> >>          int data_out[dims[0]][dims[1]];
> >>          data.resize(boost::extents[dims[0]][dims[1]]);
> >>          v_set.read( data_out, PredType::NATIVE_INT, mspace, space );
> >>          for (int i = 0; i < int(dims[0]); i++) {
> >>              for (int j = 0; j < int(dims[1]); j++) {
> >>                  data[i][j] = data_out[i][j];
> >>              }
> >>          }
> >>          v_set.close();
> >>      }
> >>      [...]
> >>
> >>
> >> ==================================================================
> >>
> >>
> >> int H5Interface::WriteIntMatrix(std::string path, uint rows,
> >>                                   uint cols, int * data) {
> >>      try {
> >>          H5::Exception::dontPrint();
> >>      hsize_t dims_m[2] = { rows, cols };
> >>          H5::DataSpace v_space (2, dims_m);
> >>      H5::DataSet v_set;
> >>      if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
> >>              v_set = file.openDataSet(path);
> >>          } else {
> >>              v_set = file.createDataSet(path, H5::PredType::NATIVE_INT,
> >> v_space);
> >>          }
> >>          v_set.write(data, H5::PredType::NATIVE_INT);
> >>          v_set.close();
> >>      }
> >>      [...]
> >>
> >>
> >>
> >> As far as the workflow goes, a scheduler provides the basic h5 file
> with
> >> all the parameters and tells the workers to load this file and then put
> >> their measurements in. So they are enlarging the file as time goes by.
> >>
> >> Have a nice day, Peter
> >>
> >>
> >>
> >> On 11/19/2012 03:36 PM, Mohamad Chaarawi wrote:
> >>> Hi Peter,
> >>>
> >>> The problem does sound strange.
> >>> I do not understand why file locking helped reduce errors. I though
> >>> you said each process writes to its own file anyway, so locking the
> >>> file or having one process manage the reads/writes should not matter
> >>> anyway.
> >>>
> >>> Is it possible you could send me a piece of code from your simulation
> >>> that is performing I/O, that I can look at and diagnose further?
> >>> A program that I can run and replicates the problem (on Lustre) would
> >>> be great. If that is not possible, then please just describe or
> >>> copy-paste how you are calling into the HDF5 library for your I/O.
> >>>
> >>> Thanks,
> >>> Mohamad
> >>>
> >>> On 11/18/2012 10:24 AM, Peter Boertz wrote:
> >>>> Hello everyone,
> >>>>
> >>>> I run simulations on a cluster (using OpenMPI) with a Lustre
> >>>> filesystem
> >>>> and I use HDF5 1.8.9 for data output. Each process has its own
> >>>> file, so
> >>>> I believe there is no need for the parallel HDF5 version, is this
> >>>> correct?
> >>>>
> >>>> When a larger number (> 4) processes want to dump their data at the
> >>>> same
> >>>> time, I get various errors of paths and objects not found or any
> other
> >>>> operation failing. I can't really make out the reason for it, as the
> >>>> code works fine on my personal workstation and runs for days with
> >>>> writes
> >>>> / reads every 5 minutes without failing.
> >>>>
> >>>> What I have tried so far is having one process manage all the
> >>>> read/write
> >>>> operations so that all other processes have to check whether anyone
> >>>> else
> >>>> is already dumping their data. I also implemented
> >>>> boost::interprocess:file_lock to prevent writing in the same file,
> >>>> which
> >>>> is however excluded by the queuing system anyway, so this was more
> >>>> of a
> >>>> paranoid move to be absolutely sure. All that helped reducing the
> >>>> number
> >>>> fatal errors significantly, but did not completely get rid of them.
> >>>> The
> >>>> biggest problem is, that some of the files get corrupted when the
> >>>> program crashes which is especially inconvenient.
> >>>>
> >>>> My question is, if there is any obvious mistake I am making and how I
> >>>> would go about solving this issue. My initial guess is that the
> Lustre
> >>>> filesystem plays some role in this, since it is the only difference
> to
> >>>> my personal computer where everything runs smoothly. As I said,
> >>>> neither
> >>>> the errors messages nor the traceback show any consistency.
> >>>>
> >>>> bye, Peter
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Hdf-forum is for HDF software users discussion.
> >>>> [email protected]
> >>>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> >>>
> >>> _______________________________________________
> >>> Hdf-forum is for HDF software users discussion.
> >>> [email protected]
> >>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> >>
> >> _______________________________________________
> >> Hdf-forum is for HDF software users discussion.
> >> [email protected]
> >> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> >
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to