A Saturday 12 December 2009 18:45:33 Ernesto escrigué: > Dear Francesc, > > thank you for your reply. I'll try to better explain my problem using > real examples of data and code. [clip]
I've been doing some benchmarks based on your requirements, and my conclusion is that the implementation of variable length types in HDF5 is not very efficient, specially with HDF5 1.8.x series (see [1]). So, you should avoid using VLArrays for saving small arrays: they fit better in table fields. With this, a possible solution is to distinguish between small and large strings (for this case). Small strings can be saved in a Table field, while larger ones will be output into a VLArray. Then you will have to add another field in the table specifying where the data is (for example -1 could mean "in this table" and any other positive value "the index in the VLArray"). You may want to experiment in order to see the optimal threshold that separates 'small' string from 'large' ones, but anything between 128 and 1024 would work fine. I'm adding the script that I've been using for my own benchmarking. Notice that if your optimal break-point (threshold) is too large (say, > 10000 bytes), then this partition is not going to work well, but chances are that your scenario would fit here easily. If not, one can think on a finer partition, but let's start by this one. [1]http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2009- December/002298.html Cheers, -- Francesc Alted
import tables as t import numpy as np LEN_INPUT = int(1e6) #BREAK_POINT = 1024 BREAK_POINT = 256 map_ = {0:'A', 1:'C', 2:'G', 3:'T'} def create_input(len): return "".join([map_[i] for i in np.random.random_integers(0,3,size=len)]) def get_short_string(len): a = np.random.standard_exponential(len) b = np.array(a*100, dtype='i4') for l in b: yield "".join([map_[i] for i in np.random.random_integers(0,3,size=l)]) def create_file(fname, verbose): class NucSeq(t.IsDescription): id = t.Int32Col(pos=1) # integer where = t.Int32Col(pos=2) gnuc = t.StringCol(1, pos=3) # 1-character String sstring = t.StringCol(BREAK_POINT-1, pos=4) fileh = t.openFile(fname, mode = "w") root = fileh.root group = fileh.createGroup(root, "newgroup") tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc", t.Filters(1, complib='lzo'), expectedrows=LEN_INPUT) nucseq = tableNuc.row vlarray = fileh.createVLArray(root, 'vlarray', t.VLStringAtom(), "vlarray test") gen_sstring = get_short_string(LEN_INPUT) for x,j in enumerate(create_input(LEN_INPUT)): sstring = gen_sstring.next() nucseq['id'] = x nucseq['gnuc'] = j if len(sstring) < BREAK_POINT: nucseq['where'] = -1 # saved locally in this table nucseq['sstring'] = sstring else: if verbose: print "saving to vlarray!", len(sstring) nucseq['where'] = vlarray.nrows # row in external VLArray vlarray.append(sstring) nucseq.append() fileh.close() if __name__=="__main__": import sys, os import getopt usage = """usage: %s [-f] [-v] filename\n""" % sys.argv[0] try: opts, pargs = getopt.getopt(sys.argv[1:], 'fv') except: sys.stderr.write(usage) sys.exit(1) doprofile = False verbose = False for option in opts: if option[0] == '-f': doprofile = True elif option[0] == '-v': verbose = True fname = pargs[0] if doprofile: import pstats import cProfile as prof prof.run('create_file(fname, verbose)', 'gataca.prof') stats = pstats.Stats('gataca.prof') stats.strip_dirs() stats.sort_stats('time', 'calls') if verbose: stats.print_stats() else: stats.print_stats(20) else: create_file(fname, verbose)
------------------------------------------------------------------------------ Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev
_______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users