Hi, Thanks for your detailed answer!
Let me first clarify the reasons why I chose to describe the data as an array of dimensions (2, 2, N, N, P, Q) In the problem I solved, the quantity of interest is a 2 x 2 tensor which is a function defined on a N x N grid of spatial coordinates. This needs to be solved for different control parameters (say, p and q), and for each combination of them I get the solution at once for all grid points --- and in Fortran order. The order of the dimensions of the array here came partly from wanting to keep the data logically in Fortran-order: the logically fastest-varying (most local) indices are first, the slowest-varying indices (least local) last. *** [clip] > However, you have been bitten clearly by the 3rd one. What happened in > your case is the next. You wanted to create a CArray with a shape of > (2,2,N,N,50,50), where N=600. As stated in the manual, the main > dimension in non-extendeable datasets (the case of CArray) is the first > one (this is by convention). So, the described algorithm to calculate > the optimal chunkshape returned (1, 1, 1, 6, 50, 50), which corresponds > to a total chunksize of 16 KB (for Float64 type), which is a reasonable > figure. However, when you tried to fill the CArray, you chose to start > feeding buckets varying the trailing dimensions more quickly. For > example, in the outer loop of your code (index i), and with > the 'optimal' computed shape, you were commanding HDF5 to (partially) > fill 2*2*600*100=240000 chunks each time. This results in a disaster > from the point of view of efficiency (you are only filling a small part > of each chunk) and a huge sink of resources (probably HDF5 tries to put > the complete set of 240000 chunks in-memory for completing the > operation). I think the memory usage was more of a problem for me than the performance, as it prevented me from running the code. Getting one (2,2,N,N) chunk from simulation may be slow, so possibly slow write speed could have been acceptable. I didn't test this, though. The annoying part is that the Python code itself does not explicitly ask Pytables to load large chunks of data to memory (eg. by iterating etc.), so in an ideal world, the underlying libraries wouldn't do it either. Is this large memory usage intrinsic to HDF5 or Pytables, and could it be reduced without a large amount of work? [clip] > In brief, as PyTables is now, and to avoid future problems with your > datasets, it is always better to make the *main dimension* as large as > possible, and fill your datasets varying the leading indices first. [clip] > Now that I see the whole picture, I know why you were trying to fill > varying the last indices first: you were trying C convention, where > trailing indices varies faster. Mmmm, I see now that, when I > implemented > the main dimension concept and the automatic computation of the > chunkshape, I should have followed the C-order convention, instead of a > Fortran-order one, which can clearly mislead people (as yourself). > Unfortunately enough, when I took this decision, I wasn't thinking about > C/Fortran ordering at all, but only in the fact that 'main' dimension > should be first, which would seem logical in some sense, but this > demonstrate to be probably very bad. Choosing the order of indices properly is probably a reasonable advice how to hint Pytables about the order of data access, but I was so happy to find chunkshape when writing the bug report that I didn't get the whole picture. Thanks to your explanation, this is obvious now: for maximum performance Pytables chunks by default the data preferably in logical C-order (ie. your F-order, most "local" index last), and my data was in the opposite, logical Fortran-order (ie. your C-order, most local index first). No wonder problems appeared... For aesthetic reasons, one could argue that that reordering indices properly shouldn't be mandatory, as a suitable HDF5 chunkshape buys back most of the performance (for iterating, an axis= parameter somewhere could be fancy). Unfortunately, I don't see any way for Pytables to automatically divine which indices in an arbitrary problem are the most "local" and which ones should be considered less so. Annoyance arising from this kind of misunderstanding could probably be avoided by adjusting the message in the PerformanceWarning in a suitable way. Possibly something along the lines of "For CArrays, Pytables expects large multidimensional datasets to be accessed in C-order, but you can adjust this by specifying a suitable chunkshape." The bit about "trimming value of dimensions orthogonal to the main dimension" is also somewhat confusing, and I (mis?)understood it to mean that I should store less data... Also, C-order and F-order might be a more familiar concept for many people than a "main dimension", even though chunkshape is more complicated than that. *** > Another thing that Ivan brought to my attention and worries me quite a > lot is the fact that chunkshapes are computed automatically in > destination each time that a user copies a dataset. The spirit of > this 'feature' is that, on each copy (and, in particular, on each > invocation of the 'ptrepack' utility), the chunkshape is 'optimized'. > The drawback is that perhaps the user wants to keep the original > chunkshape (as it is probably your case). In this sense, we plan to > add a 'chunkshape' parameter to Leaf.copy() method so that the user can > choose an automatic computation, keep the source value or force a new > different chunkshape (we are not certain about which one would be the > default, though). I'd cast my small vote on keeping the original chunkshape by default. Pytables cannot divine the intended access pattern for a general array --- the information is in chunkshape. If chunkshape was chosen manually, there's probably a reason for it. If the file was written by Pytables, the chunkshape should be optimal already. > At any rate, and as you see, there is a lot to discuss about this issue. > It would be great if we can talk about this in the list, and learn > about users needs/preferences. With this feedback, I promise to setup > a wiki page in the pytables.org site so that these opinions would be > reflected there (and people can add more stuff, if they want so). As > the time goes, we will use all the info/conclusions gathered and will > try to add a section the chapter 5 (Optimization Tips) of UG, and > possible actions for the future (C/Fortran order for PyTables 3, for > example). For people trying to do this kind of NetCDFish data storage, but not very familiar with dealing with large datasets, it might be valuable if the manual contained some discussion about handling multidimensional data that does not fit into memory. (For Tables and 1D-arrays, typically nothing needs to be done.) One could perhaps start by introducing the C- and Fortran-orders, and maybe explain how chunkshape functions as a some kind of a more general indicator to HDF5 about preferred data locality. I think a couple of examples on the lines of what works, what doesn't and why will set the readers' brain on the right track if they aren't there yet. And probably it is anyway needed to spell out more explicitly what kind of chunking Pytables CArray does by default (even though, IIRC, HDF5 internally keeps things in C-order within chunks), how to instruct it to do something more suitable for a given problem, and what does "main dimension" mean for multidimensional arrays. Allowing an order="C" or order="F" flag to CArray constructor in the future could also handle the two most common logical orderings more conveniently than having to specify a chunkshape. Actually, would adding this even break anything? Anyway, a fact is that apart from this single chunk-shaped speed bump, Pytables gave a very smooth ride otherwise. Big thanks for that! Thanks & best regards, Pauli Virtanen ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users