Hi Vladimir,
You almost certainly don't want to set your chunk size to 1, for (at
least) two reasons:
- Each chunk needs a "chunk record" in the chunk index data structure
(currently a B-tree). If you have too many chunks, the size (and I/O) of the
chunk index will dominate that of the actual chunks.
- Also, performing I/O on 1-byte will be about as expensive as 4KB, on most
machines (and sometimes as expensive as 64KB, on larger systems), so you aren't
saving anything by performing smaller I/O operations.
Quincey
On May 3, 2012, at 6:45 AM, Vladimir Daskalov wrote:
> Hi all,
>
> i'm new to the HDF5 database and till now i have managed to get the things i
> wanted from the library running.
> I'm using the C++ API and i already have running parts of my goal program,
> which uses compound datatype of
> integers, reals and VarLenTypes and all that dynamically chosen. I manage to
> write my data relatively fast with
> buffer maintained from my application and given ready for writing to the
> library.
>
> Little more about the task:
> The data is stored in one dimensional table( vector ), which consists of
> records of compound data type.
> The number of records is unknown, so the max size of the dateset is unlimited
> x 1 with enabled chunking.
> The data is received by the application record by record.
>
> Currently my application allocates space and saves the received data and when
> the applications buffer is full,
> the library's write function is called with the pointer to the buffer
> containing the data. ( my chunk size is equal to the size of my buffer )
> And the process start all over again, till the data is over.
>
> But here is the thing, that i still can't understand how the chunk cache is
> working.
> After all i read and sow by now, i thought that when i call the write to
> dataset function, the library will
> copy my data to it's cache and if the whole chunk is written( written from my
> app into HDF5's cache not to the disk from the library )
> or the cache is already full, the chunks in the cache will be written to the
> hard drive to be made space for the new incoming chunks.
> And that this writing to the hard drive doesn't happen at the precise time of
> calling the H5Dwrite function.
>
> And if i'm correct if i set the chunk cache size to some big value and set
> the chunk size to 1, all those chunks would be written
> together, when the chunk cache is full or when i close the dataset perhaps ?
> ( Why i want to use chunk size equal to 1 ? => Read below, where i explain
> how i want to read the data. )
>
> Each chunk is read with one I/O operation ( one for each chunk ), but is each
> chunk written with one separate I/O operation?
> That would explain why my data is written so slow, when i set the chunk size
> to 1 even with explicitly bigger chunk cache size.
>
> I need to write the data row by row or in other words i need to write one
> dimensional dataset, which consists of
> compound data, line by line, compound item by compound item.
> If i set the chunk to f.e. 10 records and i use the H5Dwrite to write them to
> HD5's cache one by one, is that going to be 1 or 10 I/O operations?
>
> Why i want to use chunk size equal to 1:
> Than i have to read the data again record by record or in other words line by
> line in random order. That's why
> for me would be better if i set the chunk size to 1 record, so i wouldn't
> have to read more than i need.
> I can't load all data( all chunks ) in the RAM, because i'll have to allocate
> more than 16GB and i can't afford it.
> I don't want to make the chink size bigger, because i'll have to read large
> amount of currently unused data from the disk
> and than read it again, when i really need it.
> For Example:
> ( if the chunk size is 5000 records )
> read record: 1 // chunk 1
> read record: other records
> ...
> read record: 2000000 // chunk 400
> read record: 2 // that data is in the same chunk as record 1, which is not in
> the cache, because probably it is replaced by now from other chunks
> // and i need to read the same chunk again ...
>
> That's why the best choice for me would be to read only the data i need. I'm
> ready to make the chunk size bigger than 1 in order to improve the
> writing performance, but only if there is no other way doing that. That's why
> i need more information about what happens after i call H5Dwrite.
> And how exactly the chunk cache is maintained internal. In which conditions
> are I/O operations initiated.
>
> I will appreciate all the help i can get.
> --
> V.Daskalov
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org