I've been finding pytables useful for organizing big genomics data (e.g.
storing and querying ~200 Gb all vs all uniparc smith-waterman hits from
Uniprot).

One thing that has surprised me a little:

I was interested in the efficiency of querying small tables storing an index
(integer) and an integer value vs storing the values in an array.

I am finding the second option about 10X faster when selecting the index of
a particular integer value.   I would have guessed the 'kernel' selection
would have been faster than reading out the entire array and then using
numpy.where().

Is this expected, or can I do something to make the table selection faster?

In this case I am fine with the second option, so this is just for future
reference.

eg.

Option 1:

class testTable(IsDescription):
     index = UInt8Col(pos=0)
     id = UInt32Col(pos=1)

h5_file.createTable(group,'test1',testTable,expectedrows=5000)


def fxn1(group,id):
    """
    Retrieve rows from pytables table.
    """
    return [x['index'] for x in group.test1.where("id == %s" % id)]

#########################

Option 2:

z = numpy.array([id1, id2, ...])
h5_file.createArray(group,'test2',z)


def fxn2(group,id):
    """
    Retrieve rows from pytables array.
    About 10x faster than selecting from table!
    """
    return where(group.test2.listarr == id)[0]


Thanks,
Rich
------------------------------------------------------------------------------
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to