[Pytables-users] Ticket #111. chunkshape: more documentation and hand-holding in a PerformanceWarning

Francesc Altet Thu, 22 Nov 2007 05:54:02 -0800

[This is a copy of the ticket #111 in the PyTables Trac, that has been 
opened by [EMAIL PROTECTED]  I'm answering by e-mail with copy to the 
PyTables users list because I think the subject is important enough so 
that other people would know about this, and can send their feedback.]



Foreword:  Although this subject is rather advanced, if you are a heavy 
user of PyTables, it is important that you read it entirely and 
contribute your feedback.  Thanks!


> Suppose I wanted to do the following:
>
> import tables, numpy
> N = 600
> f = tables.openFile('foo.h5', 'w')
> f.createCArray(f.root, 'huge_array', tables.Float64Atom(),
> (2,2,N,N,50,50)) for i in xrange(50):
>     for j in xrange(50):
>         f.root.huge_array[:,:,:,:,j,i] =
> numpy.array([[1,0],[0,1]])[:,:,None,None]
>
> This hogs large amounts of memory, and rightly issues the following
> PerformanceWarning?, from tables/leaf.py
>
>             if rowsize > maxrowsize:
>                 warnings.warn("""\
> array or table ``%s`` is exceeding the maximum recommended rowsize
> (%d bytes); be ready to see PyTables asking for *lots* of memory and
> possibly slow I/O. You may want to reduce the rowsize by trimming the
> value of dimensions that are orthogonal to the main dimension of this
> array or table. Alternatively, in case you have specified a very
> small chunksize, you may want to increase it."""
>                               % (self._v_pathname, maxrowsize),
>                                  PerformanceWarning)
>
> The advice offered in this message could be improved: this specific
> problem can be overcome by changing the "main dimension" of the array
> by adding a suitable chunkshape [e.g. (2,2,N,N,1,1)] to the CArray
> definition. For example,
>
> ... You may want to reduce the rowsize by specifying a suitable
> chunkshape in a CArray, or by trimming ...
>
> could maybe point the user to notice the chunkshape parameters in the
> manual. Also, I'm not sure how useful it is to ask the user to reduce
> the size of his data as the aim of Pytables is to store large amounts
> of it.
>
> Using the chunkshape parameter to solve this problem was not really
> obvious to me, and I wasted some time before finding out the
> solution. It might be useful if the manual contained more information
> on choosing the chunk sizes, and explicitly pointed out that for very
> large multidimensional arrays it may be necessary to specify it
> manually in a way depending on how the array is accessed.

First of all, the subject of chunkshapes is pretty hairy, and I'm the 
first to recognize that more documentation is needed in this field.  
The problem is that use cases may differ very much and, in addition, 
detailing the internals of chunking datasets in a plain way so that 
most people could fully understand the basic issues, and take actions, 
is not easy at all.  Being said this, we should try to offer this 
information as completely as possible (see the proposal at the end of 
this message).

Regarding your specific problem, the PerformanceWarning issued by 
PyTables is not necessarily misleading.  Your basic problem is that you 
were adding data to a CArray in dimensions that are orthogonal to the 
main dimension, and PyTables is not designed to do that; this is what 
the message is saying:  "Please trim dimensions orthogonal to the main 
dimension".  In the case of a CArray, the main dimension is always the 
first one, and you were adding data in the 5th and 6th dimensions, 
keeping the main dimension fixed.

But more to the point, why is it evil having large figures in dimensions 
that are orthogonal to the main one?.  For several reasons:

1. The iterators in PyTables leaves return rows, being the definition of 
a row a slice of the dataset in dimensions that are orthogonal to the 
main one, and having only one element in such main dimension.  With 
this, having very large dimensions orthogonal to the main one may end 
requering *long* reads and *large* memory consumption during leaf 
iterations.

2. The Leaf.copy() method, is currently implemented by copying groups of 
*rows* from source to destination.  Again, if row size is too large, 
this may not fit in memory and lead to very inefficient operation.  
This problem is of course surmountable by copying chunk by chunk, but 
this is a bit more difficult and has not been implemented yet.

3. Chunkshape suitability.  PyTables tries hard to choose a reasonable 
chunksize for each dimension based on the shape of the dataset (all 
chunksizes in every dimension together are called the chunkshape).  By 
default, it tries first to trim the value of the chunksize 
corresponding to the main dimension to the maximum, so that the user 
can add data row by row (i.e. along the main dimension).  If the 
resulting row size is too large, it starts trimming other dimensions 
(from the left to the right) until the chunkshape gives a reasonable 
value for the total chunk size (usually, no larger than 16 KB).  
However, if the user decides to not add data following the main 
dimension (as in your case), you potentially risk the underlying HDF5 
library trying to write *in parallel* into many chunks at a time, and 
this is clearly inefficient, and what is worse, it hogs a lot of memory 
resources (as it happened to you).

While 2nd issue can be solved relatively easy, 1st cannot in a general 
way, and more thought should be put to find a solution for iterators.

However, you have been bitten clearly by the 3rd one.  What happened in 
your case is the next.  You wanted to create a CArray with a shape of 
(2,2,N,N,50,50), where N=600.  As stated in the manual, the main 
dimension in non-extendeable datasets (the case of CArray) is the first 
one (this is by convention).  So, the described algorithm to calculate 
the optimal chunkshape returned (1, 1, 1, 6, 50, 50), which corresponds 
to a total chunksize of 16 KB (for Float64 type), which is a reasonable 
figure.  However, when you tried to fill the CArray, you chose to start 
feeding buckets varying the trailing dimensions more quickly.  For 
example, in the outer loop of your code (index i), and with 
the 'optimal' computed shape, you were commanding HDF5 to (partially) 
fill 2*2*600*100=240000 chunks each time.  This results in a disaster 
from the point of view of efficiency (you are only filling a small part 
of each chunk) and a huge sink of resources (probably HDF5 tries to put 
the complete set of 240000 chunks in-memory for completing the 
operation).

With this, when you switched to a chunkshape of (2,2,N,N,1,1), you ended 
with a total chunksize of 2*2*N*N*1*1, which for N=600 and doubles, 
represents an amount of 11 MB (this can be killer if you don't traverse 
the dataset in a correct dimensional order; see more later).  However, 
such a chunkshape does effectively allow for filling *completely* and 
*one by one* each of the chunks in disk, and this is why you found much 
better performance and memory usage.

Unfortunately, with the chunkshape you have chosen, the row size for 
this dataset is 13 GB, so if you try to iterate or copy it, you are on 
a big problem, as PyTables will try to load 13 GB in one shot in both 
operations, causing your script to fail (unless you have more than 13 
GB of RAM in your machine, but this is not usual nowadays).

In brief, as PyTables is now, and to avoid future problems with your 
datasets, it is always better to make the *main dimension* as large as 
possible, and fill your datasets varying the leading indices first.

Now that I see the whole picture, I know why you were trying to fill 
varying the last indices first: you were trying C convention, where 
trailing indices varies faster.  Mmmm, I see now that, when I 
implemented the main dimension concept and the automatic computation of 
the chunkshape, I should have followed the C-order convention, instead 
of a Fortran-order one, which can clearly mislead people (as yourself).
Unfortunately enough, when I took this decision, I wasn't thinking about 
C/Fortran ordering at all, but only in the fact that 'main' dimension 
should be first, which would seem logical in some sense, but this 
demonstrate to be probably very bad.

However, changing the current convention is not going to be easy (it 
might affect to many people), and should probably wait until PyTables 
3, so we will have to leave with this for a while.  One possible 
workaround is to use an EArray that allows you to choose the position 
for your main dimension (in particular, it can be the last one), 
although the chunkshape computation continues to trim from the left to 
the right, so I'm afraid that this isn't solving too much.

Another thing that Ivan brought to my attention and worries me quite a 
lot is the fact that chunkshapes are computed automatically in 
destination each time that a user copies a dataset.  The spirit of 
this 'feature' is that, on each copy (and, in particular, on each 
invocation of the 'ptrepack' utility), the chunkshape is 'optimized'.  
The drawback is that perhaps the user wants to keep the original 
chunkshape (as it is probably your case).  In this sense, we plan to 
add a 'chunkshape' parameter to Leaf.copy() method so that the user can 
choose an automatic computation, keep the source value or force a new 
different chunkshape (we are not certain about which one would be the 
default, though).

At any rate, and as you see, there is a lot to discuss about this issue.  
It would be great if we can talk about this in the list, and learn 
about users needs/preferences.  With this feedback, I promise to setup 
a wiki page in the pytables.org site so that these opinions would be 
reflected there (and people can add more stuff, if they want so).  As 
the time goes, we will use all the info/conclusions gathered and will 
try to add a section the chapter 5 (Optimization Tips) of UG, and 
possible actions for the future (C/Fortran order for PyTables 3, for 
example).

Hope that helps,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] Ticket #111. chunkshape: more documentation and hand-holding in a PerformanceWarning

Reply via email to