Re: [Pytables-users] Writing to a dataset with 'wrong' chunksize

David Worrall Wed, 12 Dec 2007 09:05:13 -0800

One can never imagine all he possible uses to which others put one's   
SW!
Will you let us know when you make the addition to the User's Guide,  
Francesc?


David

On 12/12/2007, at 9:13 PM, Francesc Altet wrote:

> Hi Mike and others,
>
> Sorry for the delay answering, but I was traveling past week.
>
> Thanks for your explanation.  I understand what you both are  
> saying, and
> this reveals how important is choosing a correct chunkshape when you
> want to get decent performance in HDF5 I/O.
>
> PyTables initially tried to hide such 'low level' details to the user,
> but after realising how important is this, we introduced
> the 'chunkshape' parameter in dataset constructors in the 2.0 series.
> While PyTables still tries hard to avoid users to think about
> chunkshape issues and automatically compute 'optimal' chunksizes, the
> fact is that this only works well when the user wants to access their
> data in the so-called 'C-order' (i.e. data is arranged in rows, not
> columns).
>
> However, users may have many valid reasons to choose another
> arrangements than the C-order one.  So, in order to cope with this,  
> I'm
> afraid that the only solution will be to add a specific section in the
> PyTables User's Guide in order to carefully explain this.  Your
> explanations will definitely help to build a better guide on how to
> choose the chunkshape that best fits the needs of the users.
>
> Thanks!
>
> A Thursday 06 December 2007, Mike Folk escrigué:
>> Fransesc et al:
>> Just to elaborate  a little bit on Quincey's
>> "slicing the wrong way" explanation.  (I hope I'm not just confusing
>> matters.)
>>
>> If possible you want to design the shape of the
>> chunk so that you get the most useful data with
>> the fewest number of accesses.  If accesses are
>> mostly contiguous elements along a certain
>> dimension, you shape the chunk to contain the
>> most elements along that dimension.  If accesses
>> are random shapes and sizes, then it gets a
>> little tricky -- we just generally recommend a
>> square (cube, etc.), but that may not be as good
>> as, say, a shape that has the same proportions as your dataset.
>>
>> So, for instance if your dataset is 3,000x6,000
>> (3,000 rows, 6,000 columns) and you always access
>> a single column, then each chunk should contain
>> as much of a column as possible, given your best
>> chunk size.  If we assume a good chunk size is
>> 600 elements, then your chunks would all be
>> 600x1, and accessing any column in its entirety
>> would take 10 accesses.  Having each chunk be a
>> part of a row (1x600) would give you the worst
>> performance in this case, since you'd need to
>> access 6,000 chunks to access a column.
>>
>> If accesses are unpredictable, perhaps a chunk
>> size of 30x60 would be best, as your worst case
>> performance (for reading a single column or row)
>> would take 100 accesses.  (By worst case, I'm
>> thinking of the case where you have to do the
>> most accesses per useful data element.)
>>
>> In other cases, such as when you slice it one way
>> you don't care about performance, but when you
>> slice it another way you really do, would call
>> for a chunk to be shaped accordingly.
>>
>> Mike
>>
>> At 11:01 AM 12/4/2007, Quincey Koziol wrote:
>>> Hi Francesc,
>>>
>>> On Dec 3, 2007, at 11:21 AM, Francesc Altet wrote:
>>>> A Monday 03 December 2007, Francesc Altet escrigué:
>>>>> Ups, I've ended with a similar program and send it to the
>>>>> [EMAIL PROTECTED] list past Saturday.  I'm attaching my own
>>>>> version (which is pretty similar to yours).  Sorry for not sending
>>>>> you a copy of my previous message, because it could saved you some
>>>>> work :-/
>>>>
>>>> Well, as Ivan pointed out, a couple of glitches slipped in my
>>>> program. I'm attaching the correct version, but the result is the
>>>> same, i.e. when N=600. I'm getting a segfault both under HDF5
>>>> 1.6.5 and 1.8.0 beta5.
>>>
>>>         I was able to duplicate the segfault
>>> with your program, but it was a
>>> stack overflow and if you move the "data" array out of main() and
>>> make it a global variable, things run to completion without error.
>>> It's _really_ slow and chews _lots_ of memory still (because you are
>>> slicing the dataset the "wrong" way), but everything seems to be
>>> working correctly.
>>>
>>>         It's somewhat hard to fix the "slicing the wrong way"
>>> problem, because the library is building a list of all the chunks
>>> that will be affected by each I/O operation (so that we can do all
>>> the I/O on each chunk at once) and that has some memory issues when
>>> dealing with I/O operations that affect so many chunks at once
>>> right now.  Building a list of all the affected chunks is good for
>>> the parallel I/O case, but could be avoided in the serial I/O case,
>>> I think.  However, that would probably make the code difficult to
>>> maintain...  :-/
>>>
>>>         You could try adjusting the chunk cache size larger, which
>>> would probably help, if you make it large enough to hold all the
>>> chunks for the dataset.
>>>
>>>         Quincey
>>>
>>>
>>> --------------------------------------------------------------------
>>> -- This mailing list is for HDF software users discussion.
>>> To subscribe to this list, send a message to
>>> [EMAIL PROTECTED] To unsubscribe, send a message to
>>> [EMAIL PROTECTED]
>>
>> --
>> Mike Folk   The HDF Group    http://hdfgroup.org     217.244.0647
>> 1901 So. First St., Suite C-2, Champaign IL 61820
>>
>>
>> ---------------------------------------------------------------------
>> - This mailing list is for HDF software users discussion.
>> To subscribe to this list, send a message to
>> [EMAIL PROTECTED] To unsubscribe, send a message to
>> [EMAIL PROTECTED]
>
>
>
> -- 
>> 0,0<   Francesc Altet     http://www.carabos.com/
> V   V   Cárabos Coop. V.   Enjoy Data
>  "-"
>
> ---------------------------------------------------------------------- 
> ---
> SF.Net email is sponsored by:
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> Pytables-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>


-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Writing to a dataset with 'wrong' chunksize

Reply via email to