We are trying to better understand the relative merits of using XML or
HDF5 file formats for a new project. Does anyone know of papers and/or
studies, either qualitatively or quantitatively, that looked at
parameters that might affect such a decision?
The project needs to store equipment sensor data covering specified time
periods along with metadata about the data and equipment. There will be
many 1000's of files which may contain binary data and matrices.
XML is the default selection, chiefly because it is ubiquitous and there
is a rich toolset supporting it. This translates directly to lower
development and maintenance costs. But, as the file size and binary data
and number of matrices increase, XML becomes less efficient to work with.
NOTE 1: because XML can be compressed resulting in much smaller file
sizes, for purposes of our investigation, we are considering compressed
XML as a different file format, cXML.
NOTE 2: we plan to use BASE64 encoding for XML binary data.
Parameters we feel are important include:
1. Time to create the files.
2. File sizes.
3. Time to read the files.
Our plan is to generate fictitious but representative data files of
various sizes, amounts of binary data and matrices, and record the above
parameters. Then, mapping this information to our use cases, should
result in us having usable empirical data with which to make a better
informed decision regarding file formats.
The above study also provides us some insight into the technical issues
related to supporting a HDF5 capability, which will need to be factored in.
Comments/thoughts on the above are appreciated.
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org