Re: [Pytables-users] Chunkshape Clarification and EArray Column Access

Francesc Alted Wed, 22 Oct 2008 02:13:53 -0700

Hi Brian,

A Tuesday 21 October 2008, B Clowers escrigué:
> PyTable Users,
>
> I've read the following thread in an attempt to better understand how
> to organize a 2D EArray/CArray and retain the ability to efficiently
> select rows or columns.
>
> http://www.mail-archive.com/[email protected]/msg0
>0723.html
>
> In this thread it was suggested that access to the columns of an
> EArray that was built by appending rows could be done efficiently if
> the appropriate chunkshape is passed (At least by my reading).  It
> was also suggested that a second copy of the data be stored in a
> different orientation but this statement was a bit unclear.  What I'm
> looking for is a clear example of how to efficiently access the
> columns an array build by appending rows.  My data come in as a
> series of rows but I would like to be able to read the columns in a
> reasonable amount of time. 
>
> Below I have a code snippet that creates a fairly large EArray by
> appending rows.  Can anyone provide some insight on how to access
> these columns efficiently and or how to make a second copy of the
> data in the file using the appropriate chunkshape?  (It is the
> chunkshape aspect that I'm unclear on how that size is chosen). 
> Thanks for all your help.


Using appropriate chunkshapes for your dataset is a complex but rather 
interesting topic that I still have to write about (I planned this for 
long time, but haven't done it yet).

In the meanwhile, I can say you that the chunkshape specifies the 
minimum amount of data that is read from a dataset on each I/O.  Based 
on this, and provided your data access pattern, you should be able to 
figure out (with the help of some experiments) which chunkshape works 
best for your needs.  In case you want two different access patterns 
(for example, access by rows or by columns), and you have enough disk, 
you can always have one EArray with a certaing chunkshape and another 
EArray with the same data but with another chunkshape; then you only 
have to select the appropriate EArray to get the maximum I/O 
performance.

At any rate, if reading performance is a high priority for you, I 
strongly encourage you to read the complete "HDF5 Datasets" chapter of 
HDF5 User's Guide [1].  Also you may find interesting sections 4.1 and 
5 of the NetCDF-4  Performance Report [2].  They give some explanation 
about chunked storage, and how performance may vary, as well as how it 
may impact filesize.  Although the report is about NetCDF-4, its 
conclusions can be applied equally to HDF5 (its underlying format).

[1] http://www.hdfgroup.org/HDF5/doc/UG/UG_frame10Datasets.html
[2] http://www.hdfgroup.org/pubs/papers/2008-06_netcdf4_perf_report.pdf

Finally, let me copy verbatim what Mike Folk (from The THG Group) wrote 
about the discussion you already read (that discussion was taking place 
at the PyTables list and hdf-forum simultaneously).  I think that can 
help to clarify concepts quite a lot:

"""
At Thursday 06 December 2007, Mike Folk wrote:

Fransesc et al:
Just to elaborate  a little bit on Quincey's 
"slicing the wrong way" explanation.  (I hope I'm not just confusing 
matters.)

If possible you want to design the shape of the 
chunk so that you get the most useful data with 
the fewest number of accesses.  If accesses are 
mostly contiguous elements along a certain 
dimension, you shape the chunk to contain the 
most elements along that dimension.  If accesses 
are random shapes and sizes, then it gets a 
little tricky -- we just generally recommend a 
square (cube, etc.), but that may not be as good 
as, say, a shape that has the same proportions as your dataset.

So, for instance if your dataset is 3,000x6,000 
(3,000 rows, 6,000 columns) and you always access 
a single column, then each chunk should contain 
as much of a column as possible, given your best 
chunk size.  If we assume a good chunk size is 
600 elements, then your chunks would all be 
600x1, and accessing any column in its entirety 
would take 10 accesses.  Having each chunk be a 
part of a row (1x600) would give you the worst 
performance in this case, since you'd need to 
access 6,000 chunks to access a column.

If accesses are unpredictable, perhaps a chunk 
size of 30x60 would be best, as your worst case 
performance (for reading a single column or row) 
would take 100 accesses.  (By worst case, I'm 
thinking of the case where you have to do the 
most accesses per useful data element.)

In other cases, such as when you slice it one way 
you don't care about performance, but when you 
slice it another way you really do, would call 
for a chunk to be shaped accordingly.

Mike
"""

Finally, just let me note that during the discussion, it became apparent 
that a multidimensional atom for EArray/CArray would be useful in some 
situations.  I'm happy to say that this will be supported in 
forthcoming PyTables 2.1 (see http://www.pytables.org/trac/ticket/133 
for details).

Hope this helps,

-- 
Francesc Alted
Freelance Developer & Consultant

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunkshape Clarification and EArray Column Access

Reply via email to