[Pytables-users] pytables or pyroot?

David Fokkema Tue, 28 Jul 2009 12:32:35 -0700

Hi list,

I've been using pytables for offline analysis for a while now. Workflow
is simple: I extract a set of data from a database and store it in hdf5
using pytables and start doing analysis work. The thing is: database
performance is breaking down now that we have about 100M events stored
and after two years of patching indexes, queries, mysql settings and
things like that we're increasingly worrying about using a relational
database for data storage in the first place. Furthermore, when
extracting real data queries get horribly complex and the data must be
postprocessed before it can be useful. Once stored in pytables,
retrieving data is, of course, very easy.


We've asked around what large experiments (LHC experiments like ATLAS)
are using and they are _not_ using db's for storage. That is expected
since a single event could take up in the order of a hundred Mb. The
point is that they are very happy with using ROOT for data storage. ROOT
is the analysis framework used by most high energy physicists and is
especially adapted to be used for data storage as well. However, not
everyone is happy with ROOT. Criticism mainly concerns the complexity of
ROOT and the cleanliness of the design.

For python users, there is pyROOT. Of course, we know and love pytables.
We're going to test several things, but I'd like to have your thoughts
on the question if pytables is a sane choice for semi-large scale data
storage. Our requirements are:

- Data is send over http and received by python scripts running behind
apache. We need concurrency (no problem for mysql)
- Each detector station sends about 40k events per day.
- Within a year or two, we need to be able to handle about 100 detector
stations, making this 4M events per day.
- Each event is about 12k
- It should be relatively easy to access all data from one detector on a
particular day
- It should be relatively easy to search for coincidences between
detector stations, based on timestamps. That is, retrieving all
timestamps from all detector stations on a particular day should be
easy.

It is possible to have a relational database containing metadata on top
of the low-level data storage. In fact, that's how ATLAS manages things.

When using pytables, what are your thoughts on the size of individual
files? One file per day? One file per detector one day? One file per
apache thread per day? The last option is probably the easiest to
implement (no need to worry about several threads accessing the same
file) but would probably make it hard to quickly access one detectors
data because it would be contained in separate files.

Your input is very much appreciated!

Thanks,

David


------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] pytables or pyroot?

Reply via email to