A Monday 04 February 2008, James Philbin escrigué: > Hi, > > > Are you experiencing some catastrophic resource > > consumption with 4096? > > OK, to try and roughly characterize the performance for large numbers > of datasets, I ran the following stress test in c: > > --- hdf5_stress_test.c --- > #include <stdio.h> > #include <stdlib.h> > > #include "H5LT.h" > > int > main(void) > { > hid_t file_id; > hsize_t dims[2]; > int data[256]; > char dset_name[32]; > herr_t status; > int i; > int total = 1000000; > > dims[0] = 16; > dims[1] = 16; > > file_id = H5Fcreate("hdf5_stress_test.h5", H5F_ACC_TRUNC, > H5P_DEFAULT, H5P_DEFAULT); > > for (i=0; i<total; ++i) { > sprintf(dset_name, "/dset_%07d", i); > status = H5LTmake_dataset(file_id, dset_name, 2, dims, > H5T_NATIVE_INT, data); > if (!(i%1000)) > printf("\r[%07d/%07d]", i, total); > fflush(stdout); > } > status = H5Fclose(file_id); > > return 0; > } > --- --- > This seemed to run OK, taking ~4m30s and not using more than 130MB of > RAM. This created a file of size 1.4GB, which seems to be quite a bit > of overhead (it should be (256*4*10^6)/(2**20) = 976MB), but there > may be extra meta-data stored which stays at a constant per dataset.
Yes. This should be the metadata (1 million of datasets have to take some space). > However, h5dump just sits there when I run it and does nothing (have > waited ~10mins). These things seem to suggest that hdf5 might not be > the best choice for my needs. Well, it depends. You should know that HDF5 allows direct access to data in the middle of datasets. So, your best bet here is to stack your small datasets into bigger ones. Then, you should only access the part of the dataset that you want. Here it is a quick example: #### stacking datasets #### import sys import numpy import tables f = tables.openFile("stacking-data.h5", "w") element_shape = (16, 16) NG = 1000 NELEMS = 1000 data = numpy.empty(element_shape, dtype="int") for i in xrange(NG): earray = f.createEArray(f.root, "dset_%04d" % i, tables.IntAtom(), (0, element_shape[0], element_shape[1]), expectedrows=NELEMS) for j in xrange(NELEMS): earray.append([data]) print "\r[%04d/%04d]" % (i, NG), sys.stdout.flush() f.close() ######################### With this, the file takes just 996MB (i.e. only 20 MB more than the 'pure' data) because we are using just 1000 groups instead of 1 million. For retrieving the datasets, just use the EArray.__getitem__() method. See an example: #### Retrieving stacked datasets #### import sys import numpy import tables f = tables.openFile("stacking-data.h5", "r") i = 0 for earray in f.walkNodes(where="/", classname="EArray"): for j in xrange(earray.nrows): element_j = earray[j] print "\r[%04d/%04d]" % (i, 1000), sys.stdout.flush() i += 1 f.close() ##################################### This takes aproximately 4 minuts to execute on my machine, and h5dump, h5ls, ptdump & other utilities runs just fine on the "stacking-data.h5" file. I hope you get the point, -- >0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-" ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users