Hi Mark,

A Saturday 11 April 2009, Mark Fenner escrigué:
> Hi folks,
>
> Quick question:
>
> I have data (lots of it) that looks like this:
>
> mydict['key_string'] = [(int1, 'str1'), (int2, 'str2'), ... (intn,
> 'strn')]
>
> The ints are 7 digits max (unsigned 24 bits max); the strings are a 3
> character code (it
> could be replaced with a 4-bit number -- possible an Enum?).

No doubt that an Enum with an 'int8' (8-bit int) as base type would 
consume far less space for your case (although enabling compression 
would help reducing your working set too).

> It would also be possible to structure the data like this, if it
> would help matters:
>
> mydict['key_string'] = [(int1, int2, int3, ...., intn), ('str1',
> 'str2', 'str3', ..., 'strn')]
>
> (or both inner and outer being lists, or both tuples, etc.  which
> ever is a better way
> to think about the PyTables data structuring)
>
> I should note that while there are _many_ keys, there are relatively
> tame entries per key (say a maximum of
> 10?  maybe 20 in a very rare instance).  The overall database is
> about 600MB which I currently wrote out to disk
> as a text python dictionary (by hand, it crashed cPickle) ... the
> data I scraped out amounted to about 300MB.
> Even reading that in with execfile was a bad idea.  I had to resort
> to reading subsets and appending them to
> the in-memory dictionary. Needless to say, these options aren't going
> to work.  I don't mind 20 minutes to build
> the datastructure, but another 20 to load it isn't going to work very
> well.  And, I typically only need some entries, not
> all of them.
>
> Assuming that I want to be able to quickly look up a 'key_string' and
> return the list of tuples (or equivalent structure), how should I
> structure a pytable to hold
> this data?  In particular, I'm puzzling out what my "row" class
> should look like.  Of course, I'd like to avoid extraneous rows if
> possible. But, maybe I'm not thinking about "rows" in the right way.
>  Since each entry (a row?) has a list of things associated with it
> and b/c those things are uniform types, I was thinking of using an
> array within a row, but I don't think that is possible.

Well, you have several possibilities here, but IMO, two would be the 
most useful:

1. Put all your data in a single, large table.  Its declaration would 
look like:

enumvals = ['YYYY','YYYZ'...]
class MyDescr(IsDescription):
    keystr: StringCol(itemsize=XX)
    icol: Int32Col()
    ecol: EnumCol(enumvals, base='int8')

Then, you only have to create the table and populate it, repeating keys 
in several entries as necessary.  In your case, it would help the query 
speed to append the entries with the same key sequentially, i.e. one 
after the other.  To retrieve the list of entries with a certain key, 
just do this:

lkey = [(r['icol'], r['ecol']) for r in tbl.where('keystr == "YYYZ"')]

This will do a lookup in the table via an optimized in-kernel query.  
Normally this is fast enough for small to medium sized tables, but if 
you need more speed, PyTables Pro will let you index the 'keystr' 
column:

tbl.cols.keystr.createIndex()

for an almost direct access to the interesting data.

2. A somewhat more efficient schema for your data (albeit a bit more 
complicated) is to place your string keys in a single-column table, and 
put your list of entries in a couple of vlarrays, one for ints 
(say, 'ivla') and the other for enum values (say, 'evla').  Now, the 
trick is to put the values in vlarrays in the same row numbers than the 
key string entries.  For example, when populating the 3 leaves, you can 
do something like:

row = tbl.row
for (i, (key, value)) in enumerate(mydict):
  row['keystr'] = key
  row.append()        # set key in row i
  ivla[i] = value[0]  # set (int1, int2, ...., intn) in row i
  evla[i] = value[1]  # set (str1, str2, ...., strn) in row i

After this, the query would be something like:

lkey = [(ivla[r.nrow], evla[r.nrow]) for r in tbl.where('keystr 
== "YYYZ"')]

Again, you can use PyTables Pro so as to index the 'keystr' column.

I'd recommend you to try both schemas and choose whatever seems more 
appropriate to your needs.

HTH,

-- 
Francesc Alted

"One would expect people to feel threatened by the 'giant
brains or machines that think'.  In fact, the frightening
computer becomes less frightening if it is used only to
simulate a familiar noncomputer."

-- Edsger W. Dykstra
   "On the cruelty of really teaching computer science"

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to