A Wednesday 22 September 2010 20:57:06 Josh Ayers escrigué: > For your reference, the application is a large data acquisition > system with around 10,000 total channels, and many different data > types. The data is currently stored in a National Instruments > semi-proprietary file format, and it's split over a few files with > several thousand channels each. I'm looking into using hdf5 as an > alternative, mostly for my own personal use. The NI file format is > poorly documented and difficult to use without their expensive > software. > > An important feature is accessing the columns by name, which is why > it seems a table would work well. I don't think multi-dimensional > columns would work for that reason. > > You indicated in the trac ticket that there was a workaround for the > HDF5 limitation. Is there anything I need to do to utilize that > workaround? I'll be manually filling in all the values for each row > before appending it to the table, so I don't need to use any default > values.
I don't think so. The fundamental problem here seems a limitation on the HDF5 type size (64 KB). Perhaps you can report that to the hdf- forum list so that the HDF crew may raise the priority to fix this limitation. > Another option would be to split the data over several tables. Then > I could either have a separate index table that lists which column > is located in which table, or just have my code search each table > until it finds the desired column. The downside to this approach is > I lose the ability to do tables.where() searches on multiple columns > if they appear in different tables, but I don't think that's too > much of a problem. If I was to do this, do you have a > recommendation for the number of columns per table? By default > PyTables gives a warning if there are more than 512 columns. Does > performance start to degrade above this number? PyTables' Table objects are stored row-wise, so if the number of columns per table grows too much, a lot of data has to be retrieved from disk even if you are interested only in the contents of one column. Hence, definitely, it is wise to keep the number of columns as low as possible. 512 is a somewhat artificial figure, and the 'degradation' does not start here, but it is progressive (i.e. it grows with the number of columns, unless you need *all* the column data during queries). Hope this helps, -- Francesc Alted ------------------------------------------------------------------------------ Start uncovering the many advantages of virtual appliances and start using them to simplify application deployment and accelerate your shift to cloud computing. http://p.sf.net/sfu/novell-sfdev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users