Guillaume, how are you? This is an interesting question, but there're
several omissions
and assumptions that make it rather ill-posed.

The omissions have to do with what you didn't tell us (and I come back to
that in a moment).
The assumptions have to do with an unspecified base on which HDF5 and
MongoDB are comparable.
(I will not spend time to discuss this second point and only state that,
apart from trivial
situations, there is no basis for such a comparison. HDF5 and MongoDB are
two very different
animals, which raises several interesting possibilities of using them
together. More on that soon...)

In any event, I suggest you spend some quality time with both candidates.
Have a look at PyTables, install MongoDB, and kick the tires. For
prototyping,
both are fun to play with. For a production solution, you need to ask and
answer
many more questions.

My first question for you would be, 'What's the data life cycle of your
data?'
You told us something about the acquisition, then what? (cleaning,
transformation,
products, distribution, (re-)use, archival, any of those?) What about the
underlying
model and the metadata that go with that?

At the indicated rate, you'll acquire about 216 million samples in 10 hours.
What's the size of an individual sample? How similar are individual samples?
By 'similar' I mean structure and value, i.e., how compressible are they?
Are they strings, or numbers disguised as strings?

How many JSON/BSON documents were you thinking about?
(MongoDB's current BSON document size limit is 16MB.) 

Do you need MongoDB sharding across instances on EC2?

How will your acquisition rate change in the future? (It for sure will go
up...)
How do you access the data? What are the interface constraints of your
clients?

In terms of raw read/write performance, I don't see a scenario where MongoDB
has a chance
to beat HDF5. This doesn't mean that MongoDB couldn't be sufficient for your
purposes.

MongoDB lets you create indexes out-of-the-box. Plain HDF5 has no such
mechanism built-in.
(PyTables does and there are add-ons for HDF5 such as FastBit.)

These are just a few pointers for your homework. Keep us posted on how
you're getting on! 

My parting comment would be this: If you're after building a long-term
archive
of large time series data, the idea of using MongoDB strikes me as rather
silly.
It wasn't made for that, it's a document database, remember?
On the other hand, using MongoDB as the catalog for metadata and to publish
time series excerpts
and aggregates is a perfectly sensible and efficient solution.

Best, G.






From: Hdf-forum [mailto:[email protected]] On Behalf Of
guillaume
Sent: Saturday, February 23, 2013 2:20 PM
To: [email protected]
Subject: [Hdf-forum] mongodb compared to HDF5 ?

Hi everyone, I'm trying to find the best fit for time series data (a lot,
let's say 1 sample every 10 ms for 10 hours which are never updated only
added and then read back) and I'd like your opinion on mongodb compared to
HDF5. Which one is the best fit ? Which one is the more performant ? Any
other pros/cons for one or the other ? Thanks a lot, Guillaume. 
________________________________________
View this message in context: mongodb compared to HDF5 ?
Sent from the hdf-forum mailing list archive at Nabble.com.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to