Re: [Pytables-users] Ticket #111. chunkshape: more documentation and hand-holding in a PerformanceWarning

Francesc Altet Fri, 23 Nov 2007 10:17:53 -0800

Hi Pauli,

A Friday 23 November 2007, Pauli Virtanen escrigué:
[...]
> The order of the dimensions of the array here came partly from
> wanting to keep the data logically in Fortran-order: the logically
> fastest-varying (most local) indices are first, the slowest-varying
> indices (least local) last.


Very interesting to know about this.

> I think the memory usage was more of a problem for me than the
> performance, as it prevented me from running the code. Getting one
> (2,2,N,N) chunk from simulation may be slow, so possibly slow write
> speed could have been acceptable. I didn't test this, though.
>
> The annoying part is that the Python code itself does not explicitly
> ask Pytables to load large chunks of data to memory (eg. by iterating
> etc.), so in an ideal world, the underlying libraries wouldn't do it
> either. Is this large memory usage intrinsic to HDF5 or Pytables, and
> could it be reduced without a large amount of work?

I think this behaviour is clearly intrinsic to HDF5 because Pytables is 
non-intrusive at all in that regard: it simply passes the write request 
to HDF5 and it is its responsability to update the apropriate chunks on 
disk.  I'm not entirely certain about why it does consumes so much 
memory in that case, but it would certainly be good to ask to the HDF5 
mailing list.

> Choosing the order of indices properly is probably a reasonable
> advice how to hint Pytables about the order of data access, but I was
> so happy to find chunkshape when writing the bug report that I didn't
> get the whole picture. Thanks to your explanation, this is obvious
> now: for maximum performance Pytables chunks by default the data
> preferably in logical C-order (ie. your F-order, most "local" index
> last), and my data was in the opposite, logical Fortran-order (ie.
> your C-order, most local index first). No wonder problems appeared...

Thanks a lot for clarifying this.  I got confused when I was thinking 
that it is better to fill an array in the place where the indices 
varies faster.  This is clearly wrong, and fortunately, Pytables chose 
the correct C-order making the main dimension to happen first.  So, we 
don't have to think about changing this in future releases (phew! :)

> For aesthetic reasons, one could argue that that reordering indices
> properly shouldn't be mandatory, as a suitable HDF5 chunkshape buys
> back most of the performance (for iterating, an axis= parameter
> somewhere could be fancy). Unfortunately, I don't see any way for
> Pytables to automatically divine which indices in an arbitrary
> problem are the most "local" and which ones should be considered less
> so.
>
> Annoyance arising from this kind of misunderstanding could probably
> be avoided by adjusting the message in the PerformanceWarning in a
> suitable way. Possibly something along the lines of "For CArrays,
> Pytables expects large multidimensional datasets to be accessed in
> C-order, but you can adjust this by specifying a suitable
> chunkshape." The bit about "trimming value of dimensions orthogonal
> to the main dimension" is also somewhat confusing, and I
> (mis?)understood it to mean that I should store less data... Also,
> C-order and F-order might be a more familiar concept for many people
> than a "main dimension", even though chunkshape is more complicated
> than that.

While I agree that the "trimming value of dimensions orthogonal
to the main dimension" sentence could be very much improved, I'm not 
certain that directing the users to use a 'home-made' chunkshape would 
be a wise think.  I'm saying this mainly because ending with a rowsize 
of 13GB (as it was your case) and having to iterate through such 
gigantic rows can be just fatal.  As you suggested, adding an axis= 
parameter to Array.iterrows() could alleviate this, but the user has to 
be very aware about the implications of choosing his own chunkshape in 
order to choose the correct axis when iterating.

Also, changing the chunkshape can have a very important impact if the 
user, inadvertedly, end choosing a big chunksize.  For example, you 
chose a chunkshape of (2,2,N,N,1,1) which leads to a chunksize of 11 
MB.  But as the chunk is the basic unit to do I/O and were filters (see 
compression) act, that means that if a you would be interested in 
accessing one single element of the CArray, you must have a need to, 
first, read 11 MB from disk, and second, uncompress them before you can 
get the element delivered to you, and this is too much work for just 
one single element (or a few).

In a word, letting the user to choose an arbitrary chunkshape could lead 
to very bad things in the long run, and this right should be exercised 
only by *very* knowledgeable users, IMO.  For me, it would be much 
better to direct users to reorganize their data to follow a sensible 
C-order by reordering dimensions, because it is easier in most cases 
and less prone to end with a unapropriate chunkshape.  In that sense, I 
still find more practical (and definitely simpler) the idea of a main 
dimension (placed in the leading dimension by default), and an 
automatic calculation (favouring C-order usage) of sensible values for 
chunkshapes.

> > Another thing that Ivan brought to my attention and worries me
> > quite a lot is the fact that chunkshapes are computed automatically
> > in destination each time that a user copies a dataset.  The spirit
> > of this 'feature' is that, on each copy (and, in particular, on
> > each invocation of the 'ptrepack' utility), the chunkshape is
> > 'optimized'. The drawback is that perhaps the user wants to keep
> > the original chunkshape (as it is probably your case).  In this
> > sense, we plan to add a 'chunkshape' parameter to Leaf.copy()
> > method so that the user can choose an automatic computation, keep
> > the source value or force a new different chunkshape (we are not
> > certain about which one would be the default, though).
>
> I'd cast my small vote on keeping the original chunkshape by default.
> Pytables cannot divine the intended access pattern for a general
> array --- the information is in chunkshape. If chunkshape was chosen
> manually, there's probably a reason for it. If the file was written
> by Pytables, the chunkshape should be optimal already.

Not quite.  Perhaps you are not using this, but one of the features I 
like best (and what many people is using) is the possibility to enlarge 
a dataset.  PyTables does compute a sensible chunkshape at creation 
time by using the estimation in expectedrows= parameter, but I'm not 
convinced that many people use this estimation.  Also, it can be not 
uncommon that tables/earrays/vlarrays may grow much more that 
originally foreseen, and what it was a sensible estimation yesterday, 
it might be clearly too small tomorrow.  Automatic recomputing during 
copies was an attempt to always keep chunkshapes in a sane state, 
relieving the user of this burden.

But I agree that this is perhaps a bit too much intrusive.  Perhaps, 
what we end doing is keeping the original value by default, and add a 
new option to the 'ptrepack' utility to 'optimize' the chunkshapes when 
repacking an existing file.  This way, users can 
automatically 'optimize' their files from time to time if they want so.

> > It would be great if we can talk about this in the list, and learn
> > about users needs/preferences.  With this feedback, I promise to
> > setup a wiki page in the pytables.org site so that these opinions
> > would be reflected there (and people can add more stuff, if they
> > want so).  As the time goes, we will use all the info/conclusions
> > gathered and will try to add a section the chapter 5 (Optimization
> > Tips) of UG, and possible actions for the future (C/Fortran order
> > for PyTables 3, for example).
>
> For people trying to do this kind of NetCDFish data storage, but not
> very familiar with dealing with large datasets, it might be valuable
> if the manual contained some discussion about handling
> multidimensional data that does not fit into memory. (For Tables and
> 1D-arrays, typically nothing needs to be done.) One could perhaps
> start by introducing the C- and Fortran-orders, and maybe explain how
> chunkshape functions as a some kind of a more general indicator to
> HDF5 about preferred data locality. I think a couple of examples on
> the lines of what works, what doesn't and why will set the readers'
> brain on the right track if they aren't there yet.
>
> And probably it is anyway needed to spell out more explicitly what
> kind of chunking Pytables CArray does by default (even though, IIRC,
> HDF5 internally keeps things in C-order within chunks), how to
> instruct it to do something more suitable for a given problem, and
> what does "main dimension" mean for multidimensional arrays.

Yeah.  Wise advices.

> Allowing an order="C" or order="F" flag to CArray constructor in the
> future could also handle the two most common logical orderings more
> conveniently than having to specify a chunkshape. Actually, would
> adding this even break anything?

Mmm, that's a very interesting suggestion.  I don't think it would be 
specially complicated to implement such an order="F" in CArray/EArray 
constructor.  This way, people more used to Fortran ordering could feel 
more comfortable using Pytables.  I'll think about this.

> Anyway, a fact is that apart from this single chunk-shaped speed
> bump, Pytables gave a very smooth ride otherwise. Big thanks for
> that!

Excellent!  Big thanks too for excellent feedback!

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Ticket #111. chunkshape: more documentation and hand-holding in a PerformanceWarning

Reply via email to