A Monday 14 December 2009 14:25:48 Francesc Alted escrigué:
> A Saturday 12 December 2009 18:45:33 Ernesto escrigué:
> > Dear Francesc,
> >
> > thank you for your reply. I'll try to better explain my problem using
> > real examples of data and code.
> 
> [clip]
> 
> I've been doing some benchmarks based on your requirements, and my
>  conclusion is that the implementation of variable length types in HDF5 is
>  not very efficient, specially with HDF5 1.8.x series (see [1]).  So, you
>  should avoid using VLArrays for saving small arrays: they fit better in
>  table fields.
> 
> With this, a possible solution is to distinguish between small and large
> strings (for this case).  Small strings can be saved in a Table field,
>  while larger ones will be output into a VLArray.  Then you will have to
>  add another field in the table specifying where the data is (for example
>  -1 could mean "in this table" and any other positive value "the index in
>  the VLArray").  You may want to experiment in order to see the optimal
>  threshold that separates 'small' string from 'large' ones, but anything
>  between 128 and 1024 would work fine.
> 
> I'm adding the script that I've been using for my own benchmarking.  Notice
> that if your optimal break-point (threshold) is too large (say, > 10000
> bytes), then this partition is not going to work well, but chances are that
> your scenario would fit here easily.  If not, one can think on a finer
> partition, but let's start by this one.

Mmh, I've done some benchmarking and most of the time in gataca2.py is 
consumed in numpy.random.  The new attached version (gataca3.py) gets rid of 
this bottleneck by returning a simpler, non-scrambled string (i.e. "A"*len).  
With this, the exponential distribution and a `BREAK_POINT` if 512, I get 
around 150 Krows/second, which I find it is quite good performance.

-- 
Francesc Alted
import tables as t
import numpy as np

LEN_INPUT = int(1e6)

#BREAK_POINT = 1024
BREAK_POINT = 512

map_ = {0:'A', 1:'C', 2:'G', 3:'T'}

def create_input(len):
    return "".join([map_[i] for i in np.random.random_integers(0,3,size=len)])

def get_short_string(len):
    a = np.random.standard_exponential(len)
    b = np.array(a*100, dtype='i4')
    for l in b:
        #yield "".join([map_[i] for i in np.random.random_integers(0,3,size=l)])
        yield "A"*l

def create_file(fname, verbose):
    class NucSeq(t.IsDescription):
        id = t.Int32Col(pos=1)        # integer
        where = t.Int32Col(pos=2)
        gnuc = t.StringCol(1, pos=3)   # 1-character String
        sstring = t.StringCol(BREAK_POINT-1, pos=4)

    fileh = t.openFile(fname, mode = "w")
    root = fileh.root
    group = fileh.createGroup(root, "newgroup")
    tableNuc = fileh.createTable(group, 'tableNuc', NucSeq, "tableNuc",
                                 t.Filters(1),
                                 expectedrows=LEN_INPUT)
    nucseq = tableNuc.row
    vlarray = fileh.createVLArray(root, 'vlarray', t.VLStringAtom(),
                                  "vlarray test")

    gen_sstring = get_short_string(LEN_INPUT)
    for x,j in enumerate(create_input(LEN_INPUT)):
        sstring = gen_sstring.next()
        nucseq['id'] = x
        nucseq['gnuc'] = j
        if len(sstring) < BREAK_POINT:
            nucseq['where'] = -1   # saved locally in this table
            nucseq['sstring'] = sstring
        else:
            if verbose:
                print "saving to vlarray!", len(sstring)
            nucseq['where'] = vlarray.nrows   # row in external VLArray
            vlarray.append(sstring)
        nucseq.append()

    fileh.close()

if __name__=="__main__":
    import sys, os
    import getopt

    usage = """usage: %s [-f] [-v] filename\n""" % sys.argv[0]
    try:
        opts, pargs = getopt.getopt(sys.argv[1:], 'fv')
    except:
        sys.stderr.write(usage)
        sys.exit(1)

    doprofile = False
    verbose = False
    for option in opts:
        if option[0] == '-f':
            doprofile = True
        elif option[0] == '-v':
            verbose = True
    fname = pargs[0]

    if doprofile:
        import pstats
        import cProfile as prof
        prof.run('create_file(fname, verbose)', 'gataca.prof')
        stats = pstats.Stats('gataca.prof')
        stats.strip_dirs()
        stats.sort_stats('time', 'calls')
        if verbose:
            stats.print_stats()
        else:
            stats.print_stats(20)
    else:
        create_file(fname, verbose)
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to