Hi list, I've been using pytables for offline analysis for a while now. Workflow is simple: I extract a set of data from a database and store it in hdf5 using pytables and start doing analysis work. The thing is: database performance is breaking down now that we have about 100M events stored and after two years of patching indexes, queries, mysql settings and things like that we're increasingly worrying about using a relational database for data storage in the first place. Furthermore, when extracting real data queries get horribly complex and the data must be postprocessed before it can be useful. Once stored in pytables, retrieving data is, of course, very easy.
We've asked around what large experiments (LHC experiments like ATLAS) are using and they are _not_ using db's for storage. That is expected since a single event could take up in the order of a hundred Mb. The point is that they are very happy with using ROOT for data storage. ROOT is the analysis framework used by most high energy physicists and is especially adapted to be used for data storage as well. However, not everyone is happy with ROOT. Criticism mainly concerns the complexity of ROOT and the cleanliness of the design. For python users, there is pyROOT. Of course, we know and love pytables. We're going to test several things, but I'd like to have your thoughts on the question if pytables is a sane choice for semi-large scale data storage. Our requirements are: - Data is send over http and received by python scripts running behind apache. We need concurrency (no problem for mysql) - Each detector station sends about 40k events per day. - Within a year or two, we need to be able to handle about 100 detector stations, making this 4M events per day. - Each event is about 12k - It should be relatively easy to access all data from one detector on a particular day - It should be relatively easy to search for coincidences between detector stations, based on timestamps. That is, retrieving all timestamps from all detector stations on a particular day should be easy. It is possible to have a relational database containing metadata on top of the low-level data storage. In fact, that's how ATLAS manages things. When using pytables, what are your thoughts on the size of individual files? One file per day? One file per detector one day? One file per apache thread per day? The last option is probably the easiest to implement (no need to worry about several threads accessing the same file) but would probably make it hard to quickly access one detectors data because it would be contained in separate files. Your input is very much appreciated! Thanks, David ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users