Hi David, A Tuesday 28 July 2009 21:14:18 David Fokkema escrigué: > Hi list, > > I've been using pytables for offline analysis for a while now. Workflow > is simple: I extract a set of data from a database and store it in hdf5 > using pytables and start doing analysis work. The thing is: database > performance is breaking down now that we have about 100M events stored > and after two years of patching indexes, queries, mysql settings and > things like that we're increasingly worrying about using a relational > database for data storage in the first place. Furthermore, when > extracting real data queries get horribly complex and the data must be > postprocessed before it can be useful. Once stored in pytables, > retrieving data is, of course, very easy. > > We've asked around what large experiments (LHC experiments like ATLAS) > are using and they are _not_ using db's for storage. That is expected > since a single event could take up in the order of a hundred Mb. The > point is that they are very happy with using ROOT for data storage. ROOT > is the analysis framework used by most high energy physicists and is > especially adapted to be used for data storage as well. However, not > everyone is happy with ROOT. Criticism mainly concerns the complexity of > ROOT and the cleanliness of the design. > > For python users, there is pyROOT. Of course, we know and love pytables. > We're going to test several things, but I'd like to have your thoughts > on the question if pytables is a sane choice for semi-large scale data > storage. Our requirements are: > > - Data is send over http and received by python scripts running behind > apache. We need concurrency (no problem for mysql)
If you need concurrency for writing you can always setup a data collector that gathers info from the several threads by using the `Queue.queue()` container. As it is thread safe, you don't have to worry about concurrency problems. > - Each detector station sends about 40k events per day. > - Within a year or two, we need to be able to handle about 100 detector > stations, making this 4M events per day. > - Each event is about 12k Well, 4 MB * 12 KB makes around 50 GB per day. Provided that PyTables can write at full disk speed (if used correctly), say 500 MB/s on a decent RAID, it can write this info in less than 2 minutes, so I would not say that this is a problem at all. You will only need to make sure that your system has a decent amount of memory so that the queue object can act as a buffer with enough capacity to cope with data bunches. > - It should be relatively easy to access all data from one detector on a > particular day > - It should be relatively easy to search for coincidences between > detector stations, based on timestamps. That is, retrieving all > timestamps from all detector stations on a particular day should be > easy. My preferences here go to use a monolithic table to save your daily observations, and then make use of the indexing capabilities of PyTables Pro for locating and retrieving your data quickly. If you can't afford buying Pro, perhaps you can split your data in several tables (hourly tables?), setup some search code that can select the appropriate table and then do an in- kernel query for the interesting table. > It is possible to have a relational database containing metadata on top > of the low-level data storage. In fact, that's how ATLAS manages things. Exactly. This is why I like to call PyTables as a relational database *teammate*, not a competitor: http://pytables.org/moin/FAQ#IsPyTablesareplacementforarelationaldatabase.3F > > When using pytables, what are your thoughts on the size of individual > files? One file per day? One file per detector one day? One file per > apache thread per day? The last option is probably the easiest to > implement (no need to worry about several threads accessing the same > file) but would probably make it hard to quickly access one detectors > data because it would be contained in separate files. As I said before, my preference goes to consolidate daily data on a single table (50 GB is not that much), but of course that would depend on your requirements and budget. At any rate there are many different possibilities, so I'd recommend you doing some experiments as it is the best way to assess the best solution to your long term needs. -- Francesc Alted ------------------------------------------------------------------------------ Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users