A Tuesday 19 May 2009 05:03:48 escriguéreu:
> On May 18, 2009, at 3:06 AM, Francesc Alted wrote:
> > A Monday 18 May 2009 10:31:47 Francesc Alted escrigué:
> >> A Sunday 17 May 2009 15:31:00 Robert Ferrell escrigué:
> >>> I have an elementary question.
> >>>
> >>> I have a dictionary with about 10,000 keys.  The keys are (shortish)
> >>> strings.  Each value is a time series of structured arrays (record
> >>> arrays) with 5 fields.  Each value totals about 100,000 bytes, so
> >>> the
> >>> total data size isn't huge, about 1GB.
> >>>
> >>> What would be a good way to store this in PyTables?  I've been
> >>> creating a group for each key, but that is a bad idea (since it's
> >>> very
> >>> slow).
> >>>
> >>> I have very little knowledge/experience with either data bases or
> >>> PyTables, so I'm pretty sure I'm just missing a basic concept.
> >>
> >> Mmh, there are several ways to implement what you want.  However,
> >> provided
> >> that your values are structured arrays, the easiest (and probably
> >> one of
> >> the fastest) way is to implement the dictionary as a monolithic
> >> table.
> >
> > Er, this is the fastest, if you have PyTables Pro and you index the
> > key field,
> > of course ;)
> >
> > Another solution in case you don't want to buy Pro is to setup a
> > VLArray of
> > ObjectAtom atoms and save every recarray in a single row.  Then,
> > build a table
> > with two fields: 'key' where you save your key and 'vrow' where you
> > save the
> > row location of your value in the VLArray.  With this, you can fetch
> > the value
> > quickly by using an idiom like:
> >
> > print 'key == "2" -->', vlarray[keys.readWhere('key == "2"')['vrow']
> > [0]]
> > print 'key == "1001" -->', vlarray[keys.readWhere('key == "1001"')
> > ['vrow'][0]]
> >
> > I'm attaching a new script based on this approach.
>
> Thanks for your quick response.  I'll try this out.  I neglected to
> mention that the time series vary somewhat in length.  I'm thinking
> that makes the VLArray desirable.  In any case, I get the idea of
> putting the keys in the table.  That's a step forward in my
> understanding.

Yet another solution is to use a single table for keeping the time series and 
another one where you keep the key, starting row for a specific time series 
and the length of this time series.  Something like:

class Record(tb.IsDescription):
    key = tb.StringCol(itemsize=10, pos=0)
    srow = tb.Int64Col(pos=1)   # start row in recarray table
    rlen = tb.Int64Col(pos=2)   # length of recarray in recarray table


With this the queries would be:

(_, srow, rlen) = k.readWhere('key == "2"')[0]
print 'key == "2" -->', v[srow:srow+rlen]
(_, srow, rlen) = k.readWhere('key == "1001"')[0]
print 'key == "1001" -->', v[srow:srow+rlen]

Attached is a simple example of this.

As I said before, there are many possibilities :)

-- 
Francesc Alted
import numpy as np
import tables as tb

N = 10000    # number of keys
M = 5        # maximum number of registers per key
array_dtype = 'int32,float64,bool' # the dtype of your recarray

np.random.seed(1001)

# Declare the key table
class Record(tb.IsDescription):
    key = tb.StringCol(itemsize=10, pos=0)
    srow = tb.Int64Col(pos=1)   # start row in recarray table
    rlen = tb.Int64Col(pos=2)   # length of recarray in recarray table

f = tb.openFile("/tmp/test.h5", "w")
k = f.createTable(f.root, 'keys', Record, expectedrows=N)
v = f.createTable(f.root, 'values', np.empty(0, array_dtype),
                  expectedrows = M*N/2)

# Feed the keys and values tables with some info
krow = k.row
vnrows = 0
for i in xrange(N):
    krow['key'] = str(i)
    rlen = np.random.randint(M)
    krow['srow'] = vnrows
    krow['rlen'] = rlen
    krow.append()
    vnrows += rlen
    value = []
    for j in xrange(rlen):
        value.append((j, i*j, i < M))
    v.append(np.array(value, dtype=array_dtype))
k.flush()
v.flush()

# Now, do some selections:
print "Result of fetches:"
(_, srow, rlen) = k.readWhere('key == "2"')[0]
print 'key == "2" -->', v[srow:srow+rlen]
(_, srow, rlen) = k.readWhere('key == "1001"')[0]
print 'key == "1001" -->', v[srow:srow+rlen]

f.close()
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables 
unlimited royalty-free distribution of the report engine 
for externally facing server and web deployment. 
http://p.sf.net/sfu/businessobjects
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to