|
Ted,
Thanks for your thoughts, especially the HDF5 metadata extraction
projects and the point of avoiding a false dichotomy between XML
and HDF5.
Tim.
On 1/2/14 5:07 PM, Ted Habermann wrote:
Tim,
I would agree with Gerd that this comparison is a bit of
apples and oranges…
I do a lot of XML and, in fact, many people consider me to be
an XML zealot, so I would agree that there are a lot of tools
out there in XML Land. However, I am not familiar with any tools
for dealing with binary data packed in XML (they may be there,
but I am not familiar with them). The “available tools” point
is, therefore, a bit hard to understand in this context…
You mention compression of XML. Gerd is correct that this is
whole file compression. You need to uncompress the whole file in
order to do anything with it. The compression approach used in
HDF is much more intelligent. It compresses different datasets
in the file independently and uncompresses only what you need.
This optimizes file sizes and access speeds.
You also mention 1000’s of files. HDF would almost certainly
give you many more aggregation options than XML with groups and
potentially virtual datasets that provide an access framework
for groups of files…
XML is really great for metadata and we are doing quite a bit
of work with XML representations of the metadata in HDF files.
This involves an HDF tool for extracting the metadata in XML for
processing independent of the data. Gerd mentioned a couple of
similar projects. I would add Nexus, which is doing quite a bit
with XML and HDF (see http://download.nexusformat.org/doc/html/design.html
and other related pages)…
Jim Collins has written about the “Tyranny of the Or” where
organizations decide between X and Y. This contrasts with the
“Power of the And”. I would encourage you to think about how XML
and HDF can most effectively be used together rather than trying
to choose between them…
Ted
Tim, Happy New Year! I'm not aware of
any comparative study.
(It'd be comparing apples and oranges: HDF5 is a smart data
container.
XML is a document/message format.) Please add it to the
Mendeley HDF group
(http://www.mendeley.com/groups/3317921/hdf/papers/)
if you happen to come
across something.
Have you considered a hybrid approach, e.g., XDMF or
SDCubes?
http://www.mendeley.com/catalog/enhancements-extensible-data-model-format-xdmf/
http://www.mendeley.com/catalog/adaptive-informatics-multifactorial-highcontent-biological-data/
My main concern would be that a pure XML approach will force
you to
reinvent (and maintain!) a lot of infrastructure in XML
that's built into HDF5
and that's transparent to end users: Not only will it not
perform at the level HDF5 does,
it'll also confuse your users. E.g., using base64 encoded,
compressed binary values is ok,
as long as you always want to decompress the entire value
and not just
subsets of it. Would you really want to mimic
chunking/tiling in XML?
Best, G.
-----Original Message-----
From: Hdf-forum
[mailto:[email protected]] On Behalf Of
Tim
Sent: Tuesday, December 31, 2013 5:06 PM
To: HDF Forum
Subject: [Hdf-forum] HDF5 vs. XML
We are trying to better understand the relative merits of
using XML or
HDF5 file formats for a new project. Does anyone know of
papers and/or studies, either qualitatively or
quantitatively, that looked at parameters that might affect
such a decision?
The project needs to store equipment sensor data covering
specified time periods along with metadata about the data
and equipment. There will be many 1000's of files which may
contain binary data and matrices.
XML is the default selection, chiefly because it is
ubiquitous and there is a rich toolset supporting it. This
translates directly to lower development and maintenance
costs. But, as the file size and binary data and number of
matrices increase, XML becomes less efficient to work with.
NOTE 1: because XML can be compressed resulting in much
smaller file sizes, for purposes of our investigation, we
are considering compressed XML as a different file format,
cXML.
NOTE 2: we plan to use BASE64 encoding for XML binary data.
Parameters we feel are important include:
1. Time to create the files.
2. File sizes.
3. Time to read the files.
Our plan is to generate fictitious but representative data
files of various sizes, amounts of binary data and matrices,
and record the above parameters. Then, mapping this
information to our use cases, should result in us having
usable empirical data with which to make a better informed
decision regarding file formats.
The above study also provides us some insight into the
technical issues related to supporting a HDF5 capability,
which will need to be factored in.
Comments/thoughts on the above are appreciated.
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
|