Tim,

I would agree with Gerd that this comparison is a bit of apples and oranges…

I do a lot of XML and, in fact, many people consider me to be an XML zealot, so 
I would agree that there are a lot of tools out there in XML Land. However, I 
am not familiar with any tools for dealing with binary data packed in XML (they 
may be there, but I am not familiar with them). The “available tools” point is, 
therefore, a bit hard to understand in this context…

You mention compression of XML. Gerd is correct that this is whole file 
compression. You need to uncompress the whole file in order to do anything with 
it. The compression approach used in HDF is much more intelligent. It 
compresses different datasets in the file independently and uncompresses only 
what you need. This optimizes file sizes and access speeds.

You also mention 1000’s of files. HDF would almost certainly give you many more 
aggregation options than XML with groups and potentially virtual datasets that 
provide an access framework for groups of files…

XML is really great for metadata and we are doing quite a bit of work with XML 
representations of the metadata in HDF files. This involves an HDF tool for 
extracting the metadata in XML for processing independent of the data. Gerd 
mentioned a couple of similar projects. I would add Nexus, which is doing quite 
a bit with XML and HDF (see 
http://download.nexusformat.org/doc/html/design.html and other related pages)…

Jim Collins has written about the “Tyranny of the Or” where organizations 
decide between X and Y. This contrasts with the “Power of the And”. I would 
encourage you to think about how XML and HDF can most effectively be used 
together rather than trying to choose between them…

Ted

By the way, you mentioned that you are storing sensor data. I worked with many 
sensor projects in NOAA and am curious about whether you are considering 
sensorML (http://www.opengeospatial.org/standards/sensorml) for your metadata.

[cid:3777702D-45F4-4250-BB1C-8AFBD78174C5]

On Jan 2, 2014, at 8:53 AM, Gerd Heber 
<[email protected]<mailto:[email protected]>> wrote:

Tim, Happy New Year! I'm not aware of any comparative study.
(It'd be comparing apples and oranges: HDF5 is a smart data container.
XML is a document/message format.) Please add it to the Mendeley HDF group
(http://www.mendeley.com/groups/3317921/hdf/papers/) if you happen to come
across something.

Have you considered a hybrid approach, e.g., XDMF or SDCubes?

http://www.mendeley.com/catalog/enhancements-extensible-data-model-format-xdmf/

http://www.mendeley.com/catalog/adaptive-informatics-multifactorial-highcontent-biological-data/

My main concern would be that a pure XML approach will force you to
reinvent (and maintain!) a lot of infrastructure in XML that's built into HDF5
and that's transparent to end users: Not only will it not perform at the level 
HDF5 does,
it'll also confuse your users. E.g., using base64 encoded, compressed binary 
values is ok,
as long as you always want to decompress the entire value and not just
subsets of it. Would you really want to mimic chunking/tiling in XML?

Best, G.

-----Original Message-----
From: Hdf-forum [mailto:[email protected]] On Behalf Of Tim
Sent: Tuesday, December 31, 2013 5:06 PM
To: HDF Forum
Subject: [Hdf-forum] HDF5 vs. XML

We are trying to better understand the relative merits of using XML or
HDF5 file formats for a new project. Does anyone know of papers and/or studies, 
either qualitatively or quantitatively, that looked at parameters that might 
affect such a decision?

The project needs to store equipment sensor data covering specified time 
periods along with metadata about the data and equipment. There will be many 
1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a 
rich toolset supporting it. This translates directly to lower development and 
maintenance costs. But, as the file size and binary data and number of matrices 
increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for 
purposes of our investigation, we are considering compressed XML as a different 
file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various 
sizes, amounts of binary data and matrices, and record the above parameters. 
Then, mapping this information to our use cases, should result in us having 
usable empirical data with which to make a better informed decision regarding 
file formats.

The above study also provides us some insight into the technical issues related 
to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

<<inline: SignatureSm2.png>>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Reply via email to