[Pytables-users] A Few Questions About Iterating Through PyTables

Geoffrey Zhu Tue, 28 Jun 2011 13:51:33 -0700

Hi All,

I am trying to iterate through records in a pytable. The records in
the table are ordered by a key. I need to first divide the records
into groups as defined by the key, then iterate through each group,
and finally iterate through records in each group. The below code does
exactly this:


def iter0(tbl):
    print "***Iter0 - Iterate records by subgroup****"
    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
        for v in m:
            print v['key'], v['value'], type(v)

The complexity comes in when I try to iterate through records in each
subgroup in a particular order, i.e., if I want to sort the records in
each group and then iterate through them. Let me generate some fake
data and then go through the four ways I tried. None of them are
ideal.


This code generates some fake data for our tests.

hf = tables.openFile('sample.h5','w')
# Generate some data
class SampleRecord(tables.IsDescription):
    key = tables.Int32Col()
    value = tables.Int32Col()


hf.createTable("/", "samples", SampleRecord, "samples")
for j in range(1, 3):
    for i in range(10,13):
        row = hf.root.samples.row
        row['key'] = j
        row['value'] = i
        row.append()
hf.root.samples.flush()
hf.flush()

The first method I tried is as follows. This looks exactly like the
previous code, but in the inner loop, I use "for v in
sorted(m,key=lambda x: -x['value'])" instead of "for v in m."

def iter1(tbl):
    print
    print "****Attempt 1**** - Iterate values by subgroup w/ records
in subgroups sorted"
    print "THIS DOES NOT WORK"
    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
        for v in sorted(m,key=lambda x: -x['value']):
            print v['key'], v['value'], type(v)

However, this gives the wrong results, as follows. I don't know what
it does not work.

****Attempt 1**** - Iterate values by subgroup w/ records in subgroups sorted
THIS DOES NOT WORK
2 10 <type 'tables.tableExtension.Row'>
2 10 <type 'tables.tableExtension.Row'>
2 10 <type 'tables.tableExtension.Row'>
2 12 <type 'tables.tableExtension.Row'>
2 12 <type 'tables.tableExtension.Row'>
2 12 <type 'tables.tableExtension.Row'>


The second method I tried is as follows. I try to copy what is in the
inner iterator into a list and then sort the list.

def iter2(tbl):
    print
    print "****Attempt 2**** - Iterate values by subgroup w/ records
in subgroups sorted"
    print "THIS DOES NOT WORK, EITHER"
    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
        temp_list = list(m)
        temp_list2 = sorted(temp_list, key=lambda x: -x['value'])
        for v in temp_list2:
            print v['key'], v['value'], type(v)

 This does not work either. The results are similar to the last one.

****Attempt 2**** - Iterate values by subgroup w/ records in subgroups sorted
THIS DOES NOT WORK, EITHER
2 10 <type 'tables.tableExtension.Row'>
2 10 <type 'tables.tableExtension.Row'>
2 10 <type 'tables.tableExtension.Row'>
2 12 <type 'tables.tableExtension.Row'>
2 12 <type 'tables.tableExtension.Row'>
2 12 <type 'tables.tableExtension.Row'>


The other two methods I tried are as follows. In these methods, I try
to get the row index number from the inner iterator and then reference
the records with these index numbers.


def iter3(tbl):
    print
    print "****Attempt 3**** - Iterate values by subgroup w/ records
in subgroups sorted"
    print "THIS WORKs, BUT TERRIBLY SLOW!"

    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
        rows = [x.nrow for x in m]
        sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value'])
        for i in sorted_rows:
            v = tbl[i]
            print v['key'], v['value'], type(v)

def iter4(tbl):
    print
    print "****Attempt 4**** - Iterate values by subgroup w/ records
in subgroups sorted"
    print "THIS WORKs, BUT TERRIBLY SLOW, TOO!"

    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
        rows = [x.nrow for x in m]
        sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value'])
        for v in tbl.itersequence(sorted_rows):
            print v['key'], v['value'], type(v)


These two methods seem to give the correct results, but they are
terribly slow. They are about 10-20 times slower than the original
iterator version.


****Attempt 3**** - Iterate values by subgroup w/ records in subgroups sorted
THIS WORKs, BUT TERRIBLY SLOW!
1 12 <type 'numpy.void'>
1 11 <type 'numpy.void'>
1 10 <type 'numpy.void'>
2 12 <type 'numpy.void'>
2 11 <type 'numpy.void'>
2 10 <type 'numpy.void'>

****Attempt 4**** - Iterate values by subgroup w/ records in subgroups sorted
THIS WORKs, BUT TERRIBLY SLOW, TOO!
1 12 <type 'tables.tableExtension.Row'>
1 11 <type 'tables.tableExtension.Row'>
1 10 <type 'tables.tableExtension.Row'>
2 12 <type 'tables.tableExtension.Row'>
2 11 <type 'tables.tableExtension.Row'>
2 10 <type 'tables.tableExtension.Row'>



My questions are:

1. Is there any better way to do this?
2. Why method 1 and 2 fail?
3. In the last two methods, notice that the types of v are different.
One is numpy.void and the other is 'tables.tableExtension.Row'. In
this example, they are used the same way, but when there are nested
structs, they are used differently -- with the former you will do
v['foo']['bar'] and with the latter, you will do v['foo/bar']. Why is
this the case?
4. If I want to copy part of the table into memory, what is the best
way of doing this?

Thanks,
Geoffrey

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] A Few Questions About Iterating Through PyTables

Reply via email to