[This is a copy of the ticket #111 in the PyTables Trac, that has been opened by [EMAIL PROTECTED] I'm answering by e-mail with copy to the PyTables users list because I think the subject is important enough so that other people would know about this, and can send their feedback.]
Foreword: Although this subject is rather advanced, if you are a heavy user of PyTables, it is important that you read it entirely and contribute your feedback. Thanks! > Suppose I wanted to do the following: > > import tables, numpy > N = 600 > f = tables.openFile('foo.h5', 'w') > f.createCArray(f.root, 'huge_array', tables.Float64Atom(), > (2,2,N,N,50,50)) for i in xrange(50): > for j in xrange(50): > f.root.huge_array[:,:,:,:,j,i] = > numpy.array([[1,0],[0,1]])[:,:,None,None] > > This hogs large amounts of memory, and rightly issues the following > PerformanceWarning?, from tables/leaf.py > > if rowsize > maxrowsize: > warnings.warn("""\ > array or table ``%s`` is exceeding the maximum recommended rowsize > (%d bytes); be ready to see PyTables asking for *lots* of memory and > possibly slow I/O. You may want to reduce the rowsize by trimming the > value of dimensions that are orthogonal to the main dimension of this > array or table. Alternatively, in case you have specified a very > small chunksize, you may want to increase it.""" > % (self._v_pathname, maxrowsize), > PerformanceWarning) > > The advice offered in this message could be improved: this specific > problem can be overcome by changing the "main dimension" of the array > by adding a suitable chunkshape [e.g. (2,2,N,N,1,1)] to the CArray > definition. For example, > > ... You may want to reduce the rowsize by specifying a suitable > chunkshape in a CArray, or by trimming ... > > could maybe point the user to notice the chunkshape parameters in the > manual. Also, I'm not sure how useful it is to ask the user to reduce > the size of his data as the aim of Pytables is to store large amounts > of it. > > Using the chunkshape parameter to solve this problem was not really > obvious to me, and I wasted some time before finding out the > solution. It might be useful if the manual contained more information > on choosing the chunk sizes, and explicitly pointed out that for very > large multidimensional arrays it may be necessary to specify it > manually in a way depending on how the array is accessed. First of all, the subject of chunkshapes is pretty hairy, and I'm the first to recognize that more documentation is needed in this field. The problem is that use cases may differ very much and, in addition, detailing the internals of chunking datasets in a plain way so that most people could fully understand the basic issues, and take actions, is not easy at all. Being said this, we should try to offer this information as completely as possible (see the proposal at the end of this message). Regarding your specific problem, the PerformanceWarning issued by PyTables is not necessarily misleading. Your basic problem is that you were adding data to a CArray in dimensions that are orthogonal to the main dimension, and PyTables is not designed to do that; this is what the message is saying: "Please trim dimensions orthogonal to the main dimension". In the case of a CArray, the main dimension is always the first one, and you were adding data in the 5th and 6th dimensions, keeping the main dimension fixed. But more to the point, why is it evil having large figures in dimensions that are orthogonal to the main one?. For several reasons: 1. The iterators in PyTables leaves return rows, being the definition of a row a slice of the dataset in dimensions that are orthogonal to the main one, and having only one element in such main dimension. With this, having very large dimensions orthogonal to the main one may end requering *long* reads and *large* memory consumption during leaf iterations. 2. The Leaf.copy() method, is currently implemented by copying groups of *rows* from source to destination. Again, if row size is too large, this may not fit in memory and lead to very inefficient operation. This problem is of course surmountable by copying chunk by chunk, but this is a bit more difficult and has not been implemented yet. 3. Chunkshape suitability. PyTables tries hard to choose a reasonable chunksize for each dimension based on the shape of the dataset (all chunksizes in every dimension together are called the chunkshape). By default, it tries first to trim the value of the chunksize corresponding to the main dimension to the maximum, so that the user can add data row by row (i.e. along the main dimension). If the resulting row size is too large, it starts trimming other dimensions (from the left to the right) until the chunkshape gives a reasonable value for the total chunk size (usually, no larger than 16 KB). However, if the user decides to not add data following the main dimension (as in your case), you potentially risk the underlying HDF5 library trying to write *in parallel* into many chunks at a time, and this is clearly inefficient, and what is worse, it hogs a lot of memory resources (as it happened to you). While 2nd issue can be solved relatively easy, 1st cannot in a general way, and more thought should be put to find a solution for iterators. However, you have been bitten clearly by the 3rd one. What happened in your case is the next. You wanted to create a CArray with a shape of (2,2,N,N,50,50), where N=600. As stated in the manual, the main dimension in non-extendeable datasets (the case of CArray) is the first one (this is by convention). So, the described algorithm to calculate the optimal chunkshape returned (1, 1, 1, 6, 50, 50), which corresponds to a total chunksize of 16 KB (for Float64 type), which is a reasonable figure. However, when you tried to fill the CArray, you chose to start feeding buckets varying the trailing dimensions more quickly. For example, in the outer loop of your code (index i), and with the 'optimal' computed shape, you were commanding HDF5 to (partially) fill 2*2*600*100=240000 chunks each time. This results in a disaster from the point of view of efficiency (you are only filling a small part of each chunk) and a huge sink of resources (probably HDF5 tries to put the complete set of 240000 chunks in-memory for completing the operation). With this, when you switched to a chunkshape of (2,2,N,N,1,1), you ended with a total chunksize of 2*2*N*N*1*1, which for N=600 and doubles, represents an amount of 11 MB (this can be killer if you don't traverse the dataset in a correct dimensional order; see more later). However, such a chunkshape does effectively allow for filling *completely* and *one by one* each of the chunks in disk, and this is why you found much better performance and memory usage. Unfortunately, with the chunkshape you have chosen, the row size for this dataset is 13 GB, so if you try to iterate or copy it, you are on a big problem, as PyTables will try to load 13 GB in one shot in both operations, causing your script to fail (unless you have more than 13 GB of RAM in your machine, but this is not usual nowadays). In brief, as PyTables is now, and to avoid future problems with your datasets, it is always better to make the *main dimension* as large as possible, and fill your datasets varying the leading indices first. Now that I see the whole picture, I know why you were trying to fill varying the last indices first: you were trying C convention, where trailing indices varies faster. Mmmm, I see now that, when I implemented the main dimension concept and the automatic computation of the chunkshape, I should have followed the C-order convention, instead of a Fortran-order one, which can clearly mislead people (as yourself). Unfortunately enough, when I took this decision, I wasn't thinking about C/Fortran ordering at all, but only in the fact that 'main' dimension should be first, which would seem logical in some sense, but this demonstrate to be probably very bad. However, changing the current convention is not going to be easy (it might affect to many people), and should probably wait until PyTables 3, so we will have to leave with this for a while. One possible workaround is to use an EArray that allows you to choose the position for your main dimension (in particular, it can be the last one), although the chunkshape computation continues to trim from the left to the right, so I'm afraid that this isn't solving too much. Another thing that Ivan brought to my attention and worries me quite a lot is the fact that chunkshapes are computed automatically in destination each time that a user copies a dataset. The spirit of this 'feature' is that, on each copy (and, in particular, on each invocation of the 'ptrepack' utility), the chunkshape is 'optimized'. The drawback is that perhaps the user wants to keep the original chunkshape (as it is probably your case). In this sense, we plan to add a 'chunkshape' parameter to Leaf.copy() method so that the user can choose an automatic computation, keep the source value or force a new different chunkshape (we are not certain about which one would be the default, though). At any rate, and as you see, there is a lot to discuss about this issue. It would be great if we can talk about this in the list, and learn about users needs/preferences. With this feedback, I promise to setup a wiki page in the pytables.org site so that these opinions would be reflected there (and people can add more stuff, if they want so). As the time goes, we will use all the info/conclusions gathered and will try to add a section the chapter 5 (Optimization Tips) of UG, and possible actions for the future (C/Fortran order for PyTables 3, for example). Hope that helps, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users