Hi Brian, A Tuesday 21 October 2008, B Clowers escrigué: > PyTable Users, > > I've read the following thread in an attempt to better understand how > to organize a 2D EArray/CArray and retain the ability to efficiently > select rows or columns. > > http://www.mail-archive.com/pytables-users@lists.sourceforge.net/msg0 >0723.html > > In this thread it was suggested that access to the columns of an > EArray that was built by appending rows could be done efficiently if > the appropriate chunkshape is passed (At least by my reading). It > was also suggested that a second copy of the data be stored in a > different orientation but this statement was a bit unclear. What I'm > looking for is a clear example of how to efficiently access the > columns an array build by appending rows. My data come in as a > series of rows but I would like to be able to read the columns in a > reasonable amount of time. > > Below I have a code snippet that creates a fairly large EArray by > appending rows. Can anyone provide some insight on how to access > these columns efficiently and or how to make a second copy of the > data in the file using the appropriate chunkshape? (It is the > chunkshape aspect that I'm unclear on how that size is chosen). > Thanks for all your help.
Using appropriate chunkshapes for your dataset is a complex but rather interesting topic that I still have to write about (I planned this for long time, but haven't done it yet). In the meanwhile, I can say you that the chunkshape specifies the minimum amount of data that is read from a dataset on each I/O. Based on this, and provided your data access pattern, you should be able to figure out (with the help of some experiments) which chunkshape works best for your needs. In case you want two different access patterns (for example, access by rows or by columns), and you have enough disk, you can always have one EArray with a certaing chunkshape and another EArray with the same data but with another chunkshape; then you only have to select the appropriate EArray to get the maximum I/O performance. At any rate, if reading performance is a high priority for you, I strongly encourage you to read the complete "HDF5 Datasets" chapter of HDF5 User's Guide [1]. Also you may find interesting sections 4.1 and 5 of the NetCDF-4 Performance Report [2]. They give some explanation about chunked storage, and how performance may vary, as well as how it may impact filesize. Although the report is about NetCDF-4, its conclusions can be applied equally to HDF5 (its underlying format). [1] http://www.hdfgroup.org/HDF5/doc/UG/UG_frame10Datasets.html [2] http://www.hdfgroup.org/pubs/papers/2008-06_netcdf4_perf_report.pdf Finally, let me copy verbatim what Mike Folk (from The THG Group) wrote about the discussion you already read (that discussion was taking place at the PyTables list and hdf-forum simultaneously). I think that can help to clarify concepts quite a lot: """ At Thursday 06 December 2007, Mike Folk wrote: Fransesc et al: Just to elaborate a little bit on Quincey's "slicing the wrong way" explanation. (I hope I'm not just confusing matters.) If possible you want to design the shape of the chunk so that you get the most useful data with the fewest number of accesses. If accesses are mostly contiguous elements along a certain dimension, you shape the chunk to contain the most elements along that dimension. If accesses are random shapes and sizes, then it gets a little tricky -- we just generally recommend a square (cube, etc.), but that may not be as good as, say, a shape that has the same proportions as your dataset. So, for instance if your dataset is 3,000x6,000 (3,000 rows, 6,000 columns) and you always access a single column, then each chunk should contain as much of a column as possible, given your best chunk size. If we assume a good chunk size is 600 elements, then your chunks would all be 600x1, and accessing any column in its entirety would take 10 accesses. Having each chunk be a part of a row (1x600) would give you the worst performance in this case, since you'd need to access 6,000 chunks to access a column. If accesses are unpredictable, perhaps a chunk size of 30x60 would be best, as your worst case performance (for reading a single column or row) would take 100 accesses. (By worst case, I'm thinking of the case where you have to do the most accesses per useful data element.) In other cases, such as when you slice it one way you don't care about performance, but when you slice it another way you really do, would call for a chunk to be shaped accordingly. Mike """ Finally, just let me note that during the discussion, it became apparent that a multidimensional atom for EArray/CArray would be useful in some situations. I'm happy to say that this will be supported in forthcoming PyTables 2.1 (see http://www.pytables.org/trac/ticket/133 for details). Hope this helps, -- Francesc Alted Freelance Developer & Consultant ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users