Re: [Pytables-users] Ticket #111. chunkshape: more documentation and hand-holding in a PerformanceWarning

Pauli Virtanen Thu, 22 Nov 2007 17:18:42 -0800

Hi,

Thanks for your detailed answer!


Let me first clarify the reasons why I chose to describe the data as an 
array of dimensions

        (2, 2, N, N, P, Q)

In the problem I solved, the quantity of interest is a 2 x 2 tensor which 
is a function defined on a N x N grid of spatial coordinates. This needs 
to be solved for different control parameters (say, p and q), and for 
each combination of them I get the solution at once for all grid points 
--- and in Fortran order.

The order of the dimensions of the array here came partly from wanting to 
keep the data logically in Fortran-order: the logically fastest-varying 
(most local) indices are first, the slowest-varying indices (least local) 
last.

    ***

[clip]
> However, you have been bitten clearly by the 3rd one.  What happened in 
> your case is the next.  You wanted to create a CArray with a shape of 
> (2,2,N,N,50,50), where N=600.  As stated in the manual, the main 
> dimension in non-extendeable datasets (the case of CArray) is the first 
> one (this is by convention).  So, the described algorithm to calculate 
> the optimal chunkshape returned (1, 1, 1, 6, 50, 50), which corresponds 
> to a total chunksize of 16 KB (for Float64 type), which is a reasonable 
> figure.  However, when you tried to fill the CArray, you chose to start 
> feeding buckets varying the trailing dimensions more quickly.  For 
> example, in the outer loop of your code (index i), and with 
> the 'optimal' computed shape, you were commanding HDF5 to (partially) 
> fill 2*2*600*100=240000 chunks each time.  This results in a disaster 
> from the point of view of efficiency (you are only filling a small part 
> of each chunk) and a huge sink of resources (probably HDF5 tries to put 
> the complete set of 240000 chunks in-memory for completing the 
> operation).

I think the memory usage was more of a problem for me than the 
performance, as it prevented me from running the code. Getting one 
(2,2,N,N) chunk from simulation may be slow, so possibly slow write speed 
could have been acceptable. I didn't test this, though.

The annoying part is that the Python code itself does not explicitly ask 
Pytables to load large chunks of data to memory (eg. by iterating etc.), 
so in an ideal world, the underlying libraries wouldn't do it either. Is 
this large memory usage intrinsic to HDF5 or Pytables, and could it be 
reduced without a large amount of work?

[clip]
> In brief, as PyTables is now, and to avoid future problems with your 
> datasets, it is always better to make the *main dimension* as large as 
> possible, and fill your datasets varying the leading indices first.

[clip]
> Now that I see the whole picture, I know why you were trying to fill
> varying the last indices first: you were trying C convention, where
> trailing indices varies faster.  Mmmm, I see now that, when I 
> implemented
> the main dimension concept and the automatic computation of the
> chunkshape, I should have followed the C-order convention, instead of a
> Fortran-order one, which can clearly mislead people (as yourself).
> Unfortunately enough, when I took this decision, I wasn't thinking about
> C/Fortran ordering at all, but only in the fact that 'main' dimension
> should be first, which would seem logical in some sense, but this
> demonstrate to be probably very bad.

Choosing the order of indices properly is probably a reasonable advice 
how to hint Pytables about the order of data access, but I was so happy 
to find chunkshape when writing the bug report that I didn't get the 
whole picture. Thanks to your explanation, this is obvious now: for 
maximum performance Pytables chunks by default the data preferably in 
logical C-order (ie. your F-order, most "local" index last), and my data 
was in the opposite, logical Fortran-order (ie. your C-order, most local 
index first). No wonder problems appeared...

For aesthetic reasons, one could argue that that reordering indices 
properly shouldn't be mandatory, as a suitable HDF5 chunkshape buys back 
most of the performance (for iterating, an axis= parameter somewhere 
could be fancy). Unfortunately, I don't see any way for Pytables to 
automatically divine which indices in an arbitrary problem are the most 
"local" and which ones should be considered less so.

Annoyance arising from this kind of misunderstanding could probably be 
avoided by adjusting the message in the PerformanceWarning in a suitable 
way. Possibly something along the lines of "For CArrays, Pytables expects 
large multidimensional datasets to be accessed in C-order, but you can 
adjust this by specifying a suitable chunkshape." The bit about "trimming 
value of dimensions orthogonal to the main dimension" is also somewhat 
confusing, and I (mis?)understood it to mean that I should store less 
data... Also, C-order and F-order might be a more familiar concept for 
many people than a "main dimension", even though chunkshape is more 
complicated than that.

    ***

> Another thing that Ivan brought to my attention and worries me quite a 
> lot is the fact that chunkshapes are computed automatically in 
> destination each time that a user copies a dataset.  The spirit of 
> this 'feature' is that, on each copy (and, in particular, on each 
> invocation of the 'ptrepack' utility), the chunkshape is 'optimized'.  
> The drawback is that perhaps the user wants to keep the original 
> chunkshape (as it is probably your case).  In this sense, we plan to 
> add a 'chunkshape' parameter to Leaf.copy() method so that the user can 
> choose an automatic computation, keep the source value or force a new 
> different chunkshape (we are not certain about which one would be the 
> default, though).

I'd cast my small vote on keeping the original chunkshape by default. 
Pytables cannot divine the intended access pattern for a general array 
--- the information is in chunkshape. If chunkshape was chosen manually, 
there's probably a reason for it. If the file was written by Pytables, 
the chunkshape should be optimal already.

> At any rate, and as you see, there is a lot to discuss about this 
issue.  
> It would be great if we can talk about this in the list, and learn 
> about users needs/preferences.  With this feedback, I promise to setup 
> a wiki page in the pytables.org site so that these opinions would be 
> reflected there (and people can add more stuff, if they want so).  As 
> the time goes, we will use all the info/conclusions gathered and will 
> try to add a section the chapter 5 (Optimization Tips) of UG, and 
> possible actions for the future (C/Fortran order for PyTables 3, for 
> example).

For people trying to do this kind of NetCDFish data storage, but not very 
familiar with dealing with large datasets, it might be valuable if the 
manual contained some discussion about handling multidimensional data 
that does not fit into memory. (For Tables and 1D-arrays, typically 
nothing needs to be done.) One could perhaps start by introducing the C- 
and Fortran-orders, and maybe explain how chunkshape functions as a some 
kind of a more general indicator to HDF5 about preferred data locality. I 
think a couple of examples on the lines of what works, what doesn't and 
why will set the readers' brain on the right track if they aren't there 
yet.

And probably it is anyway needed to spell out more explicitly what kind 
of chunking Pytables CArray does by default (even though, IIRC, HDF5 
internally keeps things in C-order within chunks), how to instruct it to 
do something more suitable for a given problem, and what does "main 
dimension" mean for multidimensional arrays.

Allowing an order="C" or order="F" flag to CArray constructor in the 
future could also handle the two most common logical orderings more 
conveniently than having to specify a chunkshape. Actually, would adding 
this even break anything?


Anyway, a fact is that apart from this single chunk-shaped speed bump, 
Pytables gave a very smooth ride otherwise. Big thanks for that!

Thanks & best regards,
Pauli Virtanen


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Ticket #111. chunkshape: more documentation and hand-holding in a PerformanceWarning

Reply via email to