A Monday 04 February 2008, James Philbin escrigué:
> Hi,
>
> > Are you experiencing some catastrophic resource
> > consumption with 4096?
>
> OK, to try and roughly characterize the performance for large numbers
> of datasets, I ran the following stress test in c:
>
> --- hdf5_stress_test.c ---
> #include <stdio.h>
> #include <stdlib.h>
>
> #include "H5LT.h"
>
> int
> main(void)
> {
>   hid_t file_id;
>   hsize_t dims[2];
>   int data[256];
>   char dset_name[32];
>   herr_t status;
>   int i;
>   int total = 1000000;
>
>   dims[0] = 16;
>   dims[1] = 16;
>
>   file_id = H5Fcreate("hdf5_stress_test.h5", H5F_ACC_TRUNC,
> H5P_DEFAULT, H5P_DEFAULT);
>
>   for (i=0; i<total; ++i) {
>     sprintf(dset_name, "/dset_%07d", i);
>     status = H5LTmake_dataset(file_id, dset_name, 2, dims,
> H5T_NATIVE_INT, data);
>     if (!(i%1000))
>       printf("\r[%07d/%07d]", i, total);
>     fflush(stdout);
>   }
>   status = H5Fclose(file_id);
>
>   return 0;
> }
> --- ---
> This seemed to run OK, taking ~4m30s and not using more than 130MB of
> RAM. This created a file of size 1.4GB, which seems to be quite a bit
> of overhead (it should be (256*4*10^6)/(2**20) = 976MB),  but there
> may be extra meta-data stored which stays at a constant per dataset.

Yes.  This should be the metadata (1 million of datasets have to take 
some space).

> However, h5dump just sits there when I run it and does nothing (have
> waited ~10mins). These things seem to suggest that hdf5 might not be
> the best choice for my needs.

Well, it depends.  You should know that HDF5 allows direct access to 
data in the middle of datasets.  So, your best bet here is to stack 
your small datasets into bigger ones.  Then, you should only access the 
part of the dataset that you want.  Here it is a quick example:

#### stacking datasets ####
import sys
import numpy
import tables

f = tables.openFile("stacking-data.h5", "w")

element_shape = (16, 16)
NG = 1000
NELEMS = 1000

data = numpy.empty(element_shape, dtype="int")
for i in xrange(NG):
    earray = f.createEArray(f.root, "dset_%04d" % i,
                            tables.IntAtom(),
                            (0, element_shape[0], element_shape[1]),
                            expectedrows=NELEMS)
    for j in xrange(NELEMS):
        earray.append([data])
    print "\r[%04d/%04d]" % (i, NG),
    sys.stdout.flush()

f.close()
#########################

With this, the file takes just 996MB (i.e. only 20 MB more than 
the 'pure' data) because we are using just 1000 groups instead of 1 
million.

For retrieving the datasets, just use the EArray.__getitem__() method.  
See an example:

#### Retrieving stacked datasets ####
import sys
import numpy
import tables

f = tables.openFile("stacking-data.h5", "r")

i = 0
for earray in f.walkNodes(where="/", classname="EArray"):
    for j in xrange(earray.nrows):
        element_j = earray[j]
    print "\r[%04d/%04d]" % (i, 1000),
    sys.stdout.flush()
    i += 1

f.close()
#####################################

This takes aproximately 4 minuts to execute on my machine, and h5dump, 
h5ls, ptdump & other utilities runs just fine on the "stacking-data.h5" 
file.

I hope you get the point,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to