A Saturday 12 December 2009 18:45:33 Ernesto escrigué:
> Dear Francesc,
> 
> thank you for your reply. I'll try to better explain my problem using
> real examples of data and code.
[clip]

I've been doing some benchmarks based on your requirements, and my conclusion
is that the implementation of variable length types in HDF5 is not very 
efficient, specially with HDF5 1.8.x series (see [1]).  So, you should avoid 
using VLArrays for saving small arrays: they fit better in table fields.

With this, a possible solution is to distinguish between small and large 
strings (for this case).  Small strings can be saved in a Table field, while 
larger ones will be output into a VLArray.  Then you will have to add another 
field in the table specifying where the data is (for example -1 could mean "in 
this table" and any other positive value "the index in the VLArray").  You may 
want to experiment in order to see the optimal threshold that separates 
'small' string from 'large' ones, but anything between 128 and 1024 would work 
fine.

I'm adding the script that I've been using for my own benchmarking.  Notice 
that if your optimal break-point (threshold) is too large (say, > 10000 
bytes), then this partition is not going to work well, but chances are that 
your scenario would fit here easily.  If not, one can think on a finer 
partition, but let's start by this one.

[1]http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2009-
December/002298.html

Cheers,

-- 
Francesc Alted
import tables as t
import numpy as np

LEN_INPUT = int(1e6)

#BREAK_POINT = 1024
BREAK_POINT = 256

map_ = {0:'A', 1:'C', 2:'G', 3:'T'}

def create_input(len):
    return "".join([map_[i] for i in np.random.random_integers(0,3,size=len)])

def get_short_string(len):
    a = np.random.standard_exponential(len)
    b = np.array(a*100, dtype='i4')
    for l in b:
        yield "".join([map_[i] for i in np.random.random_integers(0,3,size=l)])

def create_file(fname, verbose):
    class NucSeq(t.IsDescription):
        id = t.Int32Col(pos=1)        # integer
        where = t.Int32Col(pos=2)
        gnuc = t.StringCol(1, pos=3)   # 1-character String
        sstring = t.StringCol(BREAK_POINT-1, pos=4)

    fileh = t.openFile(fname, mode = "w")
    root = fileh.root
    group = fileh.createGroup(root, "newgroup")
    tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc",
                                 t.Filters(1, complib='lzo'),
                                 expectedrows=LEN_INPUT)
    nucseq = tableNuc.row
    vlarray = fileh.createVLArray(root, 'vlarray', t.VLStringAtom(),
                                  "vlarray test")

    gen_sstring = get_short_string(LEN_INPUT)
    for x,j in enumerate(create_input(LEN_INPUT)):
        sstring = gen_sstring.next()
        nucseq['id'] = x
        nucseq['gnuc'] = j
        if len(sstring) < BREAK_POINT:
            nucseq['where'] = -1   # saved locally in this table
            nucseq['sstring'] = sstring
        else:
            if verbose:
                print "saving to vlarray!", len(sstring)
            nucseq['where'] = vlarray.nrows   # row in external VLArray
            vlarray.append(sstring)
        nucseq.append()

    fileh.close()

if __name__=="__main__":
    import sys, os
    import getopt

    usage = """usage: %s [-f] [-v] filename\n""" % sys.argv[0]
    try:
        opts, pargs = getopt.getopt(sys.argv[1:], 'fv')
    except:
        sys.stderr.write(usage)
        sys.exit(1)

    doprofile = False
    verbose = False
    for option in opts:
        if option[0] == '-f':
            doprofile = True
        elif option[0] == '-v':
            verbose = True
    fname = pargs[0]

    if doprofile:
        import pstats
        import cProfile as prof
        prof.run('create_file(fname, verbose)', 'gataca.prof')
        stats = pstats.Stats('gataca.prof')
        stats.strip_dirs()
        stats.sort_stats('time', 'calls')
        if verbose:
            stats.print_stats()
        else:
            stats.print_stats(20)
    else:
        create_file(fname, verbose)
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to