Re: [Pytables-users] A Few Questions About Iterating Through PyTables

Josh Ayers Wed, 29 Jun 2011 07:51:56 -0700

Here's an alternative method that uses the built-in search capabilities in
PyTables in place of the itertools library.


Using readWhere as shown below will return a NumPy ndarray of the data that
matches the query.  I think that answers your question #4.  There are
similar methods - where and getWhereList - that return an iterator over the
matching rows and a list of the matching row indices, respectively.  They
may be more appropriate depending on your use case.

def iter5(tbl):

   keys = set(tbl.col('key'))
   for _key in keys:
      rows = tbl.readWhere('key == _key')
      rows.sort(order = ['value'])
      for row in rows:
         print(row['key'], row['value'])

Hope this helps,
Josh


On Tue, Jun 28, 2011 at 4:51 PM, Geoffrey Zhu <zyzhu2...@gmail.com> wrote:

> Hi All,
>
> I am trying to iterate through records in a pytable. The records in
> the table are ordered by a key. I need to first divide the records
> into groups as defined by the key, then iterate through each group,
> and finally iterate through records in each group. The below code does
> exactly this:
>
> def iter0(tbl):
>    print "***Iter0 - Iterate records by subgroup****"
>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
>        for v in m:
>            print v['key'], v['value'], type(v)
>
> The complexity comes in when I try to iterate through records in each
> subgroup in a particular order, i.e., if I want to sort the records in
> each group and then iterate through them. Let me generate some fake
> data and then go through the four ways I tried. None of them are
> ideal.
>
>
> This code generates some fake data for our tests.
>
> hf = tables.openFile('sample.h5','w')
> # Generate some data
> class SampleRecord(tables.IsDescription):
>    key = tables.Int32Col()
>    value = tables.Int32Col()
>
>
> hf.createTable("/", "samples", SampleRecord, "samples")
> for j in range(1, 3):
>    for i in range(10,13):
>        row = hf.root.samples.row
>        row['key'] = j
>        row['value'] = i
>        row.append()
> hf.root.samples.flush()
> hf.flush()
>
> The first method I tried is as follows. This looks exactly like the
> previous code, but in the inner loop, I use "for v in
> sorted(m,key=lambda x: -x['value'])" instead of "for v in m."
>
> def iter1(tbl):
>    print
>    print "****Attempt 1**** - Iterate values by subgroup w/ records
> in subgroups sorted"
>    print "THIS DOES NOT WORK"
>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
>        for v in sorted(m,key=lambda x: -x['value']):
>            print v['key'], v['value'], type(v)
>
> However, this gives the wrong results, as follows. I don't know what
> it does not work.
>
> ****Attempt 1**** - Iterate values by subgroup w/ records in subgroups
> sorted
> THIS DOES NOT WORK
> 2 10 <type 'tables.tableExtension.Row'>
> 2 10 <type 'tables.tableExtension.Row'>
> 2 10 <type 'tables.tableExtension.Row'>
> 2 12 <type 'tables.tableExtension.Row'>
> 2 12 <type 'tables.tableExtension.Row'>
> 2 12 <type 'tables.tableExtension.Row'>
>
>
> The second method I tried is as follows. I try to copy what is in the
> inner iterator into a list and then sort the list.
>
> def iter2(tbl):
>    print
>    print "****Attempt 2**** - Iterate values by subgroup w/ records
> in subgroups sorted"
>    print "THIS DOES NOT WORK, EITHER"
>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
>        temp_list = list(m)
>        temp_list2 = sorted(temp_list, key=lambda x: -x['value'])
>        for v in temp_list2:
>            print v['key'], v['value'], type(v)
>
>  This does not work either. The results are similar to the last one.
>
> ****Attempt 2**** - Iterate values by subgroup w/ records in subgroups
> sorted
> THIS DOES NOT WORK, EITHER
> 2 10 <type 'tables.tableExtension.Row'>
> 2 10 <type 'tables.tableExtension.Row'>
> 2 10 <type 'tables.tableExtension.Row'>
> 2 12 <type 'tables.tableExtension.Row'>
> 2 12 <type 'tables.tableExtension.Row'>
> 2 12 <type 'tables.tableExtension.Row'>
>
>
> The other two methods I tried are as follows. In these methods, I try
> to get the row index number from the inner iterator and then reference
> the records with these index numbers.
>
>
> def iter3(tbl):
>    print
>    print "****Attempt 3**** - Iterate values by subgroup w/ records
> in subgroups sorted"
>    print "THIS WORKs, BUT TERRIBLY SLOW!"
>
>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
>        rows = [x.nrow for x in m]
>        sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value'])
>        for i in sorted_rows:
>            v = tbl[i]
>            print v['key'], v['value'], type(v)
>
> def iter4(tbl):
>    print
>    print "****Attempt 4**** - Iterate values by subgroup w/ records
> in subgroups sorted"
>    print "THIS WORKs, BUT TERRIBLY SLOW, TOO!"
>
>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
>        rows = [x.nrow for x in m]
>        sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value'])
>        for v in tbl.itersequence(sorted_rows):
>            print v['key'], v['value'], type(v)
>
>
> These two methods seem to give the correct results, but they are
> terribly slow. They are about 10-20 times slower than the original
> iterator version.
>
>
> ****Attempt 3**** - Iterate values by subgroup w/ records in subgroups
> sorted
> THIS WORKs, BUT TERRIBLY SLOW!
> 1 12 <type 'numpy.void'>
> 1 11 <type 'numpy.void'>
> 1 10 <type 'numpy.void'>
> 2 12 <type 'numpy.void'>
> 2 11 <type 'numpy.void'>
> 2 10 <type 'numpy.void'>
>
> ****Attempt 4**** - Iterate values by subgroup w/ records in subgroups
> sorted
> THIS WORKs, BUT TERRIBLY SLOW, TOO!
> 1 12 <type 'tables.tableExtension.Row'>
> 1 11 <type 'tables.tableExtension.Row'>
> 1 10 <type 'tables.tableExtension.Row'>
> 2 12 <type 'tables.tableExtension.Row'>
> 2 11 <type 'tables.tableExtension.Row'>
> 2 10 <type 'tables.tableExtension.Row'>
>
>
>
> My questions are:
>
> 1. Is there any better way to do this?
> 2. Why method 1 and 2 fail?
> 3. In the last two methods, notice that the types of v are different.
> One is numpy.void and the other is 'tables.tableExtension.Row'. In
> this example, they are used the same way, but when there are nested
> structs, they are used differently -- with the former you will do
> v['foo']['bar'] and with the latter, you will do v['foo/bar']. Why is
> this the case?
> 4. If I want to copy part of the table into memory, what is the best
> way of doing this?
>
> Thanks,
> Geoffrey
>
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] A Few Questions About Iterating Through PyTables

Reply via email to