Re: [Hdf-forum] HDF5 chunk cache

Quincey Koziol Tue, 15 May 2012 12:17:41 -0700

Hi Vladimir,
        You almost certainly don't want to set your chunk size to 1, for (at 
least) two reasons:


- Each chunk needs a "chunk record" in the chunk index data structure 
(currently a B-tree).  If you have too many chunks, the size (and I/O) of the 
chunk index will dominate that of the actual chunks.

- Also, performing I/O on 1-byte will be about as expensive as 4KB, on most 
machines (and sometimes as expensive as 64KB, on larger systems), so you aren't 
saving anything by performing smaller I/O operations.

        Quincey

On May 3, 2012, at 6:45 AM, Vladimir Daskalov wrote:

> Hi all,
> 
> i'm new to the HDF5 database and till now i have managed to get the things i 
> wanted from the library running.
> I'm using the C++ API and i already have running parts of my goal program, 
> which uses compound datatype of 
> integers, reals and VarLenTypes and all that dynamically chosen. I manage to 
> write my data relatively fast with 
> buffer maintained from my application and given ready for writing to the 
> library.
> 
> Little more about the task:
> The data is stored in one dimensional table( vector ), which consists of 
> records of compound data type.
> The number of records is unknown, so the max size of the dateset is unlimited 
> x 1 with enabled chunking.
> The data is received by the application record by record.
> 
> Currently my application allocates space and saves the received data and when 
> the applications buffer is full,
> the library's write function is called with the pointer to the buffer 
> containing the data. ( my chunk size is equal to the size of my buffer )
> And the process start all over again, till the data is over.
> 
> But here is the thing, that i still can't understand how the chunk cache is 
> working.
> After all i read and sow by now, i thought that when i call the write to 
> dataset function, the library will 
> copy my data to it's cache and if the whole chunk is written( written from my 
> app into HDF5's cache not to the disk from the library ) 
> or the cache is already full, the chunks in the cache will be written to the 
> hard drive to be made space for the new incoming chunks.
> And that this writing to the hard drive doesn't happen at the precise time of 
> calling the H5Dwrite function.
> 
> And if i'm correct if i set the chunk cache size to some big value and set 
> the chunk size to 1, all those chunks would be written 
> together, when the chunk cache is full or when i close the dataset perhaps ?
> ( Why i want to use chunk size equal to 1 ? => Read below, where i explain 
> how i want to read the data. )
> 
> Each chunk is read with one I/O operation ( one for each chunk ), but is each 
> chunk  written with one separate I/O operation?
> That would explain why my data is written so slow, when i set the chunk size 
> to 1 even with explicitly bigger chunk cache size.
> 
> I need to write the data row by row or in other words i need to write one 
> dimensional dataset, which consists of
> compound data, line by line, compound item by compound item.
> If i set the chunk to f.e. 10 records and i use the H5Dwrite to write them to 
> HD5's cache one by one, is that going to be 1 or 10 I/O operations?
> 
> Why i want to use chunk size equal to 1:
> Than i have to read the data again record by record or in other words line by 
> line in random order. That's why
> for me would be better if i set the chunk size to 1 record, so i wouldn't 
> have to read more than i need.
> I can't load all data( all chunks ) in the RAM, because i'll have to allocate 
> more than 16GB and i can't afford it.
> I don't want to make the chink size bigger, because i'll have to read large 
> amount of currently unused data from the disk
> and than read it again, when i really need it.
> For Example:
> ( if the chunk size is 5000 records )
> read record: 1 // chunk 1
> read record: other records
> ...
> read record: 2000000 // chunk 400
> read record: 2 // that data is in the same chunk as record 1, which is not in 
> the cache, because probably it is replaced by now from other chunks
> // and i need to read the same chunk again ...
> 
> That's why the best choice for me would be to read only the data i need. I'm 
> ready to make the chunk size bigger than 1 in order to improve the
> writing performance, but only if there is no other way doing that. That's why 
> i need more information about what happens after i call H5Dwrite.
> And how exactly the chunk cache is maintained internal. In which conditions 
> are I/O operations initiated.
> 
> I will appreciate all the help i can get.
> -- 
> V.Daskalov
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] HDF5 chunk cache

Reply via email to