Hi Jens,
On Oct 14, 2010, at 9:43 AM, Jens Thoms Toerring wrote:
> Hi Quincey,
>
> On Tue, Oct 12, 2010 at 02:28:41PM -0500, Quincey Koziol wrote:
>> On Oct 12, 2010, at 2:24 PM, Jens Thoms Toerring wrote:
>>>>> Finally, there's another thing perhaps someone can help me
>>>>> with: I tried to create some 120.000 1D data sets, about
>>>>> 200 bytes large and each in it's own group. This resulted
>>>>> in a huge overhead in the file: instead of the expected file
>>>>> size of arond 24 MB (of course plus a bit for overhead) the
>>>>> files were about 10 times larger than expected. Using a number
>>>>> (30) of 2D data sets (with 4000 rows) took care of this but I
>>>>> am curious why this makes such a big difference.
>>>>
>>>> Did you create them as chunked datasets? And, what were the dimensions
>>>> of the chunk sizes you used?
>>>
>>> No, those were simple 1-dimensional data sets, written out in a
>>> single call immediately after creation and then closed. Perhps
>>> having them all in their own group makes a difference? What I
>>> noticed was that h5dump on the resulting file told me under
>>> Storage information/Groups that for B-tree/List about 140 MB
>>> were used...
>>
>> This is very weird, can you send a sample program that shows this result?
>
> Here's a stripped down version of my original program: it now
> just creates 100.000 datasets with 5 doubles, each within its
> own group. The amount of "real" data, including strings for
> group and dataset names should be about 5 MB, but the file I
> get with HDF5, version 1.8.5, is nearly 144 MB large. I expect
> a certain amount of overhead, of course, but that ratio was a
> bit astonishing;-)
>
> If I leave out the creation of the datasets (i.e. just create
> 100.000 groups) the size of the file drops to about 80 MB,
> so creating a single group seems to "cost" about 800 byte.
About what I'd expect.
> Creating just 100.000 datasets (without groups) seems to be
> less expensive, here the overhead seems to be in the order
> of 350 bytes per dataset. Does that seems reasonable to you?
That sounds approximately correct also.
Adding those two numbers together gives me ~115MB. Plus 100,000 * 5 *
8 bytes (for the raw data) brings things up to ~120MB. So there's
approximately 24MB "missing" from the equation somewhere. (dark metadata! :-)
Pointing h5stat with the "-f -F -g -G -d -D -T -A -s -S" options at the
file produced gives only 16488 bytes of unaccounted for space, so not very much
space has been wasted due to internal free space fragmentation. There's
27,200,000 bytes of space used for dataset object headers, right around the 300
bytes per dataset you mention, so that's OK. There's 95,291,840 bytes of
B-tree information and 13,441,824 bytes of heap information for groups (~1087
bytes per group), which is above the 800 bytes per group that you mention and
accounts for the missing space in the file.
Changing your HDF5Writer constructor to be this:
HDF5Writer( H5std_string const & fileName )
{
hid_t fapl = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_libver_bounds(fapl, H5F_LIBVER_LATEST, H5F_LIBVER_LATEST);
FileAccPropList FileAccessPList(fapl);
m_file = new H5File( fileName, H5F_ACC_TRUNC, FileCreatPropList::DEFAULT,
m_group = new Group( m_file->openGroup( "/" ) );
}
(which enables the "latest/latest" option to H5Pset_libver_bounds) give
a file that is only 50MB with 41543 bytes of unaccounted space, and only has
and ~177 bytes of metadata information per group (although a bit more for the
dataset objects at ~284 each, curiously). That's probably a good option for
you here, and you could tweak it down further, if you wanted, with the
H5Pset_link_phase_change and H5Pset_est_link_info calls. The one drawback of
using this option is that the files created will only be able to be read by the
1.8.x releases of the library.
Quincey
> Best regards, Jens
>
> ------------- h5_test.cpp ----------------------------------------
>
> #include <iostream>
> #include <sstream>
> #include <stack>
> #include <vector>
> #include <string>
> #include "H5Cpp.h"
>
> using namespace std;
> using namespace H5;
>
> class HDF5Writer {
>
> public:
>
> HDF5Writer( H5std_string const & fileName )
> {
> m_file = new H5File( fileName, H5F_ACC_TRUNC );
> m_group = new Group( m_file->openGroup( "/" ) );
> }
>
> ~HDF5Writer( )
> {
> while ( ! m_group_stack.empty( ) )
> closeGroup( );
> m_group->close( );
> delete m_group;
> m_file->close( );
> delete m_file;
> }
>
> void createGroup( H5std_string const & name)
> {
> m_group_stack.push( m_group );
> m_group = new Group( m_group->createGroup( name ) );
> }
>
> void closeGroup( )
> {
> m_group->close( );
> delete m_group;
> m_group = m_group_stack.top( );
> m_group_stack.pop( );
> }
>
> void writeVector( H5std_string const & name,
> vector< double > const & data )
> {
> hsize_t dim[ ] = { data.size( ) };
> DataSpace dataspace( 1, dim );
> DataSet dataset( m_group->createDataSet( name, PredType::IEEE_F64LE,
> dataspace ) );
> dataset.write( &data.front( ), PredType::NATIVE_DOUBLE );
> dataset.close( );
> dataspace.close( );
> }
>
> private:
>
> H5File * m_file;
> Group * m_group;
> stack< Group * > m_group_stack;
> };
>
> int main( )
> {
> HDF5Writer w( "test.h5" );
> vector< double > arr( 5, 0 );
>
> for ( size_t i = 0; i < 100000; i++ )
> {
> ostringstream cname;
> cname << "g" << i;
> w.createGroup( cname.str( ) );
> w.writeVector( "d", arr );
> w.closeGroup( );
> }
> }
>
> --
> \ Jens Thoms Toerring ________ [email protected]
> \_______________________________ http://toerring.de
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org