Hi Mike and others,

Sorry for the delay answering, but I was traveling past week.

Thanks for your explanation.  I understand what you both are saying, and 
this reveals how important is choosing a correct chunkshape when you 
want to get decent performance in HDF5 I/O.

PyTables initially tried to hide such 'low level' details to the user, 
but after realising how important is this, we introduced 
the 'chunkshape' parameter in dataset constructors in the 2.0 series.  
While PyTables still tries hard to avoid users to think about 
chunkshape issues and automatically compute 'optimal' chunksizes, the 
fact is that this only works well when the user wants to access their 
data in the so-called 'C-order' (i.e. data is arranged in rows, not 
columns).

However, users may have many valid reasons to choose another 
arrangements than the C-order one.  So, in order to cope with this, I'm 
afraid that the only solution will be to add a specific section in the 
PyTables User's Guide in order to carefully explain this.  Your 
explanations will definitely help to build a better guide on how to 
choose the chunkshape that best fits the needs of the users.

Thanks!

A Thursday 06 December 2007, Mike Folk escrigué:
> Fransesc et al:
> Just to elaborate  a little bit on Quincey's
> "slicing the wrong way" explanation.  (I hope I'm not just confusing
> matters.)
>
> If possible you want to design the shape of the
> chunk so that you get the most useful data with
> the fewest number of accesses.  If accesses are
> mostly contiguous elements along a certain
> dimension, you shape the chunk to contain the
> most elements along that dimension.  If accesses
> are random shapes and sizes, then it gets a
> little tricky -- we just generally recommend a
> square (cube, etc.), but that may not be as good
> as, say, a shape that has the same proportions as your dataset.
>
> So, for instance if your dataset is 3,000x6,000
> (3,000 rows, 6,000 columns) and you always access
> a single column, then each chunk should contain
> as much of a column as possible, given your best
> chunk size.  If we assume a good chunk size is
> 600 elements, then your chunks would all be
> 600x1, and accessing any column in its entirety
> would take 10 accesses.  Having each chunk be a
> part of a row (1x600) would give you the worst
> performance in this case, since you'd need to
> access 6,000 chunks to access a column.
>
> If accesses are unpredictable, perhaps a chunk
> size of 30x60 would be best, as your worst case
> performance (for reading a single column or row)
> would take 100 accesses.  (By worst case, I'm
> thinking of the case where you have to do the
> most accesses per useful data element.)
>
> In other cases, such as when you slice it one way
> you don't care about performance, but when you
> slice it another way you really do, would call
> for a chunk to be shaped accordingly.
>
> Mike
>
> At 11:01 AM 12/4/2007, Quincey Koziol wrote:
> >Hi Francesc,
> >
> >On Dec 3, 2007, at 11:21 AM, Francesc Altet wrote:
> >>A Monday 03 December 2007, Francesc Altet escrigué:
> >>>Ups, I've ended with a similar program and send it to the
> >>>[EMAIL PROTECTED] list past Saturday.  I'm attaching my own
> >>>version (which is pretty similar to yours).  Sorry for not sending
> >>>you a copy of my previous message, because it could saved you some
> >>>work :-/
> >>
> >>Well, as Ivan pointed out, a couple of glitches slipped in my
> >> program. I'm attaching the correct version, but the result is the
> >> same, i.e. when N=600. I'm getting a segfault both under HDF5
> >> 1.6.5 and 1.8.0 beta5.
> >
> >         I was able to duplicate the segfault
> > with your program, but it was a
> >stack overflow and if you move the "data" array out of main() and
> >make it a global variable, things run to completion without error.
> >It's _really_ slow and chews _lots_ of memory still (because you are
> >slicing the dataset the "wrong" way), but everything seems to be
> >working correctly.
> >
> >         It's somewhat hard to fix the "slicing the wrong way"
> > problem, because the library is building a list of all the chunks
> > that will be affected by each I/O operation (so that we can do all
> > the I/O on each chunk at once) and that has some memory issues when
> > dealing with I/O operations that affect so many chunks at once
> > right now.  Building a list of all the affected chunks is good for
> > the parallel I/O case, but could be avoided in the serial I/O case,
> > I think.  However, that would probably make the code difficult to
> > maintain...  :-/
> >
> >         You could try adjusting the chunk cache size larger, which
> > would probably help, if you make it large enough to hold all the
> > chunks for the dataset.
> >
> >         Quincey
> >
> >
> >--------------------------------------------------------------------
> >-- This mailing list is for HDF software users discussion.
> >To subscribe to this list, send a message to
> > [EMAIL PROTECTED] To unsubscribe, send a message to
> > [EMAIL PROTECTED]
>
> --
> Mike Folk   The HDF Group    http://hdfgroup.org     217.244.0647
> 1901 So. First St., Suite C-2, Champaign IL 61820
>
>
> ---------------------------------------------------------------------
>- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> [EMAIL PROTECTED] To unsubscribe, send a message to
> [EMAIL PROTECTED]



-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to