Re: [Pytables-users] Re: Pytable Datasets From HDF5 API

Francesc Altet Mon, 06 Feb 2006 02:30:01 -0800

Hi Dieter,

El dg 05 de 02 del 2006 a les 20:19 -0500, en/na dHering va escriure:
> Well I realized that I am supposed to use H5S_ALL instead of using
> H5Dget_space().. So I'm ok now in reading the datasets. Still would
> like to understand the chunking confusion I have though and If you are
> aware of any user discussion groups.

Glad that you are finally reading your PyTables files directly from C.
Regarding the other questions, see later.

> On 2/4/06, dHering <[EMAIL PROTECTED]> wrote:
>         Besides this, I have a couple general questions regarding
>         hdf5: 
>         
>         It seems to me that selecting a chunk size when writing an
>         hdf5 dataset is arbitrary. Is there a rule for selecting a
>         size for chunking? Since the struct is an encapsulated
>         element, should my chunk size be "1" ? Or maybe the number of
>         fields in my row (struct)? Or multiple rows? I don't
>         understand the relevance. 

I don't quite understand why do you need to select the chunk size when
writing PyTables files as this is automatically selected by PyTables
based on hints about the total size of the dataset that can be provided
by the users (see `expectedrows` parameter in factory methods). However,
if you need to write them from C, you may want to have a look at the
code that choose the chunksizes for Table objects in
'tables/utils.py/_calcBufferSize()'. From comments in the code:

    # Rational: HDF5 takes the data in bunches of chunksize length
    # to write the on disk. A BTree in memory is used to map structures
    # on disk. The more chunks that are allocated for a dataset the
    # larger the B-tree. Large B-trees take memory and causes file
    # storage overhead as well as more disk I/O and higher contention
    # for the meta data cache.
    # You have to balance between memory and I/O overhead
    # (small B-trees) and time to access to data (big B-trees).
    #
    # The tuning of the chunksize & buffersize parameters affects the
    # performance and the memory size consumed. This is based on
    # experiments on a Intel arquitecture and, as always, your mileage
    # may vary.

So, in general, the values chosen by PyTables are based on experiments
I've carefully made and are mostly appropriate for general use. However,
you may have your own requeriments, so feel free to experiment what
values are best for you. More info about how to choose the a good chunk
size can be found in: http://hdf.ncsa.uiuc.edu/HDF5/doc/Chunking.html.
If you end finding better values that those above, please, share your
experiences.

>         
>         Lastly, There only seems to be the HDF5 help desk
>         ([EMAIL PROTECTED]). I wonder why there is not any type of
>         mail-list or forum for researching and discussing hdf5
>         particularities. Are you aware of any? 

Well, hdfhelp is a good point (and the recommended one) to get
information about HDF5 issues.

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Re: Pytable Datasets From HDF5 API

Reply via email to