Tim, Happy New Year! I'm not aware of any comparative study. (It'd be comparing apples and oranges: HDF5 is a smart data container. XML is a document/message format.) Please add it to the Mendeley HDF group (http://www.mendeley.com/groups/3317921/hdf/papers/) if you happen to come across something.
Have you considered a hybrid approach, e.g., XDMF or SDCubes? http://www.mendeley.com/catalog/enhancements-extensible-data-model-format-xdmf/ http://www.mendeley.com/catalog/adaptive-informatics-multifactorial-highcontent-biological-data/ My main concern would be that a pure XML approach will force you to reinvent (and maintain!) a lot of infrastructure in XML that's built into HDF5 and that's transparent to end users: Not only will it not perform at the level HDF5 does, it'll also confuse your users. E.g., using base64 encoded, compressed binary values is ok, as long as you always want to decompress the entire value and not just subsets of it. Would you really want to mimic chunking/tiling in XML? Best, G. -----Original Message----- From: Hdf-forum [mailto:[email protected]] On Behalf Of Tim Sent: Tuesday, December 31, 2013 5:06 PM To: HDF Forum Subject: [Hdf-forum] HDF5 vs. XML We are trying to better understand the relative merits of using XML or HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision? The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices. XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with. NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML. NOTE 2: we plan to use BASE64 encoding for XML binary data. Parameters we feel are important include: 1. Time to create the files. 2. File sizes. 3. Time to read the files. Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats. The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in. Comments/thoughts on the above are appreciated. _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
