Re: [Hdf-forum] mongodb compared to HDF5 ?

Gerd Heber Mon, 25 Feb 2013 06:17:00 -0800

Guillaume, how are you?

> Well, the life cycle of my data would mainly be archival and data lookup
on conditions
> (like "retrieve every row where column B equals 12", much like what
pyTables and MongoDB
> can do on queries).
> 1 sample can be from 4 bytes to 256 bytes (strings), it can be int, double
or strings or else.


It's tempting to have the same storage layout for acquisition, archival, and
retrieval, and
in simple cases this might even work. Generally, it's not always such a
great idea. 

> If I understand correctly what you are saying, you think that there are
not really sense
> of using MongoDB for time series as there would be not really sense of
using HDF5
> for storing documents?

That's one way of putting it. You can obviously mimic storing documents in
HDF5,
the same way you can mimic storing time series in MongoDB. And mimicking is
good
enough, sometimes. It really depends on what your expectations for quality
are.

> Where can I find examples for FastBit?

Here's a quotation from John Wu's earlier posting:

"Both FastQuery and FastBit are available in source code form

FastQuery http://codeforge.lbl.gov/projects/fastquery
FastBit http://codeforge.lbl.gov/projects/fastbit

Feel free to join FastBit mailing list
<https://hpcrdm.lbl.gov/pipermail/fastbit-users> to post your questions
regarding FastBit and FastQuery."

Best, G.


-----Message d'origine-----
De : Hdf-forum [mailto:[email protected]] De la part de Gerd
Heber Envoyé : dimanche 24 février 2013 18:18 À : 'HDF Users Discussion
List'
Objet : Re: [Hdf-forum] mongodb compared to HDF5 ?


Guillaume, how are you? This is an interesting question, but there're
several omissions and assumptions that make it rather ill-posed.

The omissions have to do with what you didn't tell us (and I come back to
that in a moment).
The assumptions have to do with an unspecified base on which HDF5 and
MongoDB are comparable.
(I will not spend time to discuss this second point and only state that,
apart from trivial situations, there is no basis for such a comparison. HDF5
and MongoDB are two very different animals, which raises several interesting
possibilities of using them together. More on that soon...)

In any event, I suggest you spend some quality time with both candidates.
Have a look at PyTables, install MongoDB, and kick the tires. For
prototyping, both are fun to play with. For a production solution, you need
to ask and answer many more questions.

My first question for you would be, 'What's the data life cycle of your
data?'
You told us something about the acquisition, then what? (cleaning,
transformation, products, distribution, (re-)use, archival, any of those?)
What about the underlying model and the metadata that go with that?

At the indicated rate, you'll acquire about 216 million samples in 10 hours.
What's the size of an individual sample? How similar are individual samples?
By 'similar' I mean structure and value, i.e., how compressible are they?
Are they strings, or numbers disguised as strings?

How many JSON/BSON documents were you thinking about?
(MongoDB's current BSON document size limit is 16MB.) 

Do you need MongoDB sharding across instances on EC2?

How will your acquisition rate change in the future? (It for sure will go
up...)
How do you access the data? What are the interface constraints of your
clients?

In terms of raw read/write performance, I don't see a scenario where MongoDB
has a chance to beat HDF5. This doesn't mean that MongoDB couldn't be
sufficient for your purposes.

MongoDB lets you create indexes out-of-the-box. Plain HDF5 has no such
mechanism built-in.
(PyTables does and there are add-ons for HDF5 such as FastBit.)

These are just a few pointers for your homework. Keep us posted on how
you're getting on! 

My parting comment would be this: If you're after building a long-term
archive of large time series data, the idea of using MongoDB strikes me as
rather silly.
It wasn't made for that, it's a document database, remember?
On the other hand, using MongoDB as the catalog for metadata and to publish
time series excerpts and aggregates is a perfectly sensible and efficient
solution.

Best, G.






From: Hdf-forum [mailto:[email protected]] On Behalf Of
guillaume
Sent: Saturday, February 23, 2013 2:20 PM
To: [email protected]
Subject: [Hdf-forum] mongodb compared to HDF5 ?

Hi everyone, I'm trying to find the best fit for time series data (a lot,
let's say 1 sample every 10 ms for 10 hours which are never updated only
added and then read back) and I'd like your opinion on mongodb compared to
HDF5. Which one is the best fit ? Which one is the more performant ? Any
other pros/cons for one or the other ? Thanks a lot, Guillaume. 
________________________________________
View this message in context: mongodb compared to HDF5 ?
Sent from the hdf-forum mailing list archive at Nabble.com.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] mongodb compared to HDF5 ?

Reply via email to