[This is a copy of the ticket #111 in the PyTables Trac, that has been
opened by [EMAIL PROTECTED] I'm answering by e-mail with copy to the
PyTables users list because I think the subject is important enough so
that other people would know about this, and can send their feedback.]
Foreword: Although this subject is rather advanced, if you are a heavy
user of PyTables, it is important that you read it entirely and
contribute your feedback. Thanks!
> Suppose I wanted to do the following:
>
> import tables, numpy
> N = 600
> f = tables.openFile('foo.h5', 'w')
> f.createCArray(f.root, 'huge_array', tables.Float64Atom(),
> (2,2,N,N,50,50)) for i in xrange(50):
> for j in xrange(50):
> f.root.huge_array[:,:,:,:,j,i] =
> numpy.array([[1,0],[0,1]])[:,:,None,None]
>
> This hogs large amounts of memory, and rightly issues the following
> PerformanceWarning?, from tables/leaf.py
>
> if rowsize > maxrowsize:
> warnings.warn("""\
> array or table ``%s`` is exceeding the maximum recommended rowsize
> (%d bytes); be ready to see PyTables asking for *lots* of memory and
> possibly slow I/O. You may want to reduce the rowsize by trimming the
> value of dimensions that are orthogonal to the main dimension of this
> array or table. Alternatively, in case you have specified a very
> small chunksize, you may want to increase it."""
> % (self._v_pathname, maxrowsize),
> PerformanceWarning)
>
> The advice offered in this message could be improved: this specific
> problem can be overcome by changing the "main dimension" of the array
> by adding a suitable chunkshape [e.g. (2,2,N,N,1,1)] to the CArray
> definition. For example,
>
> ... You may want to reduce the rowsize by specifying a suitable
> chunkshape in a CArray, or by trimming ...
>
> could maybe point the user to notice the chunkshape parameters in the
> manual. Also, I'm not sure how useful it is to ask the user to reduce
> the size of his data as the aim of Pytables is to store large amounts
> of it.
>
> Using the chunkshape parameter to solve this problem was not really
> obvious to me, and I wasted some time before finding out the
> solution. It might be useful if the manual contained more information
> on choosing the chunk sizes, and explicitly pointed out that for very
> large multidimensional arrays it may be necessary to specify it
> manually in a way depending on how the array is accessed.
First of all, the subject of chunkshapes is pretty hairy, and I'm the
first to recognize that more documentation is needed in this field.
The problem is that use cases may differ very much and, in addition,
detailing the internals of chunking datasets in a plain way so that
most people could fully understand the basic issues, and take actions,
is not easy at all. Being said this, we should try to offer this
information as completely as possible (see the proposal at the end of
this message).
Regarding your specific problem, the PerformanceWarning issued by
PyTables is not necessarily misleading. Your basic problem is that you
were adding data to a CArray in dimensions that are orthogonal to the
main dimension, and PyTables is not designed to do that; this is what
the message is saying: "Please trim dimensions orthogonal to the main
dimension". In the case of a CArray, the main dimension is always the
first one, and you were adding data in the 5th and 6th dimensions,
keeping the main dimension fixed.
But more to the point, why is it evil having large figures in dimensions
that are orthogonal to the main one?. For several reasons:
1. The iterators in PyTables leaves return rows, being the definition of
a row a slice of the dataset in dimensions that are orthogonal to the
main one, and having only one element in such main dimension. With
this, having very large dimensions orthogonal to the main one may end
requering *long* reads and *large* memory consumption during leaf
iterations.
2. The Leaf.copy() method, is currently implemented by copying groups of
*rows* from source to destination. Again, if row size is too large,
this may not fit in memory and lead to very inefficient operation.
This problem is of course surmountable by copying chunk by chunk, but
this is a bit more difficult and has not been implemented yet.
3. Chunkshape suitability. PyTables tries hard to choose a reasonable
chunksize for each dimension based on the shape of the dataset (all
chunksizes in every dimension together are called the chunkshape). By
default, it tries first to trim the value of the chunksize
corresponding to the main dimension to the maximum, so that the user
can add data row by row (i.e. along the main dimension). If the
resulting row size is too large, it starts trimming other dimensions
(from the left to the right) until the chunkshape gives a reasonable
value for the total chunk size (usually, no larger than 16 KB).
However, if the user decides to not add data following the main
dimension (as in your case), you potentially risk the underlying HDF5
library trying to write *in parallel* into many chunks at a time, and
this is clearly inefficient, and what is worse, it hogs a lot of memory
resources (as it happened to you).
While 2nd issue can be solved relatively easy, 1st cannot in a general
way, and more thought should be put to find a solution for iterators.
However, you have been bitten clearly by the 3rd one. What happened in
your case is the next. You wanted to create a CArray with a shape of
(2,2,N,N,50,50), where N=600. As stated in the manual, the main
dimension in non-extendeable datasets (the case of CArray) is the first
one (this is by convention). So, the described algorithm to calculate
the optimal chunkshape returned (1, 1, 1, 6, 50, 50), which corresponds
to a total chunksize of 16 KB (for Float64 type), which is a reasonable
figure. However, when you tried to fill the CArray, you chose to start
feeding buckets varying the trailing dimensions more quickly. For
example, in the outer loop of your code (index i), and with
the 'optimal' computed shape, you were commanding HDF5 to (partially)
fill 2*2*600*100=240000 chunks each time. This results in a disaster
from the point of view of efficiency (you are only filling a small part
of each chunk) and a huge sink of resources (probably HDF5 tries to put
the complete set of 240000 chunks in-memory for completing the
operation).
With this, when you switched to a chunkshape of (2,2,N,N,1,1), you ended
with a total chunksize of 2*2*N*N*1*1, which for N=600 and doubles,
represents an amount of 11 MB (this can be killer if you don't traverse
the dataset in a correct dimensional order; see more later). However,
such a chunkshape does effectively allow for filling *completely* and
*one by one* each of the chunks in disk, and this is why you found much
better performance and memory usage.
Unfortunately, with the chunkshape you have chosen, the row size for
this dataset is 13 GB, so if you try to iterate or copy it, you are on
a big problem, as PyTables will try to load 13 GB in one shot in both
operations, causing your script to fail (unless you have more than 13
GB of RAM in your machine, but this is not usual nowadays).
In brief, as PyTables is now, and to avoid future problems with your
datasets, it is always better to make the *main dimension* as large as
possible, and fill your datasets varying the leading indices first.
Now that I see the whole picture, I know why you were trying to fill
varying the last indices first: you were trying C convention, where
trailing indices varies faster. Mmmm, I see now that, when I
implemented the main dimension concept and the automatic computation of
the chunkshape, I should have followed the C-order convention, instead
of a Fortran-order one, which can clearly mislead people (as yourself).
Unfortunately enough, when I took this decision, I wasn't thinking about
C/Fortran ordering at all, but only in the fact that 'main' dimension
should be first, which would seem logical in some sense, but this
demonstrate to be probably very bad.
However, changing the current convention is not going to be easy (it
might affect to many people), and should probably wait until PyTables
3, so we will have to leave with this for a while. One possible
workaround is to use an EArray that allows you to choose the position
for your main dimension (in particular, it can be the last one),
although the chunkshape computation continues to trim from the left to
the right, so I'm afraid that this isn't solving too much.
Another thing that Ivan brought to my attention and worries me quite a
lot is the fact that chunkshapes are computed automatically in
destination each time that a user copies a dataset. The spirit of
this 'feature' is that, on each copy (and, in particular, on each
invocation of the 'ptrepack' utility), the chunkshape is 'optimized'.
The drawback is that perhaps the user wants to keep the original
chunkshape (as it is probably your case). In this sense, we plan to
add a 'chunkshape' parameter to Leaf.copy() method so that the user can
choose an automatic computation, keep the source value or force a new
different chunkshape (we are not certain about which one would be the
default, though).
At any rate, and as you see, there is a lot to discuss about this issue.
It would be great if we can talk about this in the list, and learn
about users needs/preferences. With this feedback, I promise to setup
a wiki page in the pytables.org site so that these opinions would be
reflected there (and people can add more stuff, if they want so). As
the time goes, we will use all the info/conclusions gathered and will
try to add a section the chapter 5 (Optimization Tips) of UG, and
possible actions for the future (C/Fortran order for PyTables 3, for
example).
Hope that helps,
--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users