We are trying to better understand the relative merits of using XML or HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision?

The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats.

The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Reply via email to