Re: [Pytables-users] A Few Questions About Iterating Through PyTables

Josh Ayers Wed, 29 Jun 2011 11:30:37 -0700

Geoffrey,

I think the difference is that your iter0 function accesses the records in
the order in which they appear in the table, while all the other methods
sort them.


If you only need to read the records associated with a single key at a time,
you could take advantage of the fact that the keys are sorted by storing the
beginning and ending record indices for each key in a separate data
structure.  You could use a separate table for this, or a dictionary stored
as an attribute on the original table.  That would make retrieving one key's
worth of data an O(m) operation, where m is the number of records using that
key.

Thanks,
Josh



On Wed, Jun 29, 2011 at 12:19 PM, Geoffrey Zhu <zyzhu2...@gmail.com> wrote:

> Hi Josh,
>
> Thanks for your response.  The problem of readWhere() is that it does
> not fully take advantage of the fact that my table is sorted and the
> time to iterate over it should be no greater than O(n).
>
> The strange thing is that my iter0() is really fast but all other
> versions are really slow. Maybe iter0() is only reading the fields I
> access whereas the other versions read the whole records into memory.
>
> Thanks,
> Geoffrey
>
> On Wed, Jun 29, 2011 at 9:51 AM, Josh Ayers <josh.ay...@gmail.com> wrote:
> > Here's an alternative method that uses the built-in search capabilities
> in
> > PyTables in place of the itertools library.
> >
> > Using readWhere as shown below will return a NumPy ndarray of the data
> that
> > matches the query.  I think that answers your question #4.  There are
> > similar methods - where and getWhereList - that return an iterator over
> the
> > matching rows and a list of the matching row indices, respectively.  They
> > may be more appropriate depending on your use case.
> >
> > def iter5(tbl):
> >
> >    keys = set(tbl.col('key'))
> >    for _key in keys:
> >       rows = tbl.readWhere('key == _key')
> >       rows.sort(order = ['value'])
> >       for row in rows:
> >          print(row['key'], row['value'])
> >
> > Hope this helps,
> > Josh
> >
> >
> > On Tue, Jun 28, 2011 at 4:51 PM, Geoffrey Zhu <zyzhu2...@gmail.com>
> wrote:
> >>
> >> Hi All,
> >>
> >> I am trying to iterate through records in a pytable. The records in
> >> the table are ordered by a key. I need to first divide the records
> >> into groups as defined by the key, then iterate through each group,
> >> and finally iterate through records in each group. The below code does
> >> exactly this:
> >>
> >> def iter0(tbl):
> >>    print "***Iter0 - Iterate records by subgroup****"
> >>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
> >>        for v in m:
> >>            print v['key'], v['value'], type(v)
> >>
> >> The complexity comes in when I try to iterate through records in each
> >> subgroup in a particular order, i.e., if I want to sort the records in
> >> each group and then iterate through them. Let me generate some fake
> >> data and then go through the four ways I tried. None of them are
> >> ideal.
> >>
> >>
> >> This code generates some fake data for our tests.
> >>
> >> hf = tables.openFile('sample.h5','w')
> >> # Generate some data
> >> class SampleRecord(tables.IsDescription):
> >>    key = tables.Int32Col()
> >>    value = tables.Int32Col()
> >>
> >>
> >> hf.createTable("/", "samples", SampleRecord, "samples")
> >> for j in range(1, 3):
> >>    for i in range(10,13):
> >>        row = hf.root.samples.row
> >>        row['key'] = j
> >>        row['value'] = i
> >>        row.append()
> >> hf.root.samples.flush()
> >> hf.flush()
> >>
> >> The first method I tried is as follows. This looks exactly like the
> >> previous code, but in the inner loop, I use "for v in
> >> sorted(m,key=lambda x: -x['value'])" instead of "for v in m."
> >>
> >> def iter1(tbl):
> >>    print
> >>    print "****Attempt 1**** - Iterate values by subgroup w/ records
> >> in subgroups sorted"
> >>    print "THIS DOES NOT WORK"
> >>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
> >>        for v in sorted(m,key=lambda x: -x['value']):
> >>            print v['key'], v['value'], type(v)
> >>
> >> However, this gives the wrong results, as follows. I don't know what
> >> it does not work.
> >>
> >> ****Attempt 1**** - Iterate values by subgroup w/ records in subgroups
> >> sorted
> >> THIS DOES NOT WORK
> >> 2 10 <type 'tables.tableExtension.Row'>
> >> 2 10 <type 'tables.tableExtension.Row'>
> >> 2 10 <type 'tables.tableExtension.Row'>
> >> 2 12 <type 'tables.tableExtension.Row'>
> >> 2 12 <type 'tables.tableExtension.Row'>
> >> 2 12 <type 'tables.tableExtension.Row'>
> >>
> >>
> >> The second method I tried is as follows. I try to copy what is in the
> >> inner iterator into a list and then sort the list.
> >>
> >> def iter2(tbl):
> >>    print
> >>    print "****Attempt 2**** - Iterate values by subgroup w/ records
> >> in subgroups sorted"
> >>    print "THIS DOES NOT WORK, EITHER"
> >>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
> >>        temp_list = list(m)
> >>        temp_list2 = sorted(temp_list, key=lambda x: -x['value'])
> >>        for v in temp_list2:
> >>            print v['key'], v['value'], type(v)
> >>
> >>  This does not work either. The results are similar to the last one.
> >>
> >> ****Attempt 2**** - Iterate values by subgroup w/ records in subgroups
> >> sorted
> >> THIS DOES NOT WORK, EITHER
> >> 2 10 <type 'tables.tableExtension.Row'>
> >> 2 10 <type 'tables.tableExtension.Row'>
> >> 2 10 <type 'tables.tableExtension.Row'>
> >> 2 12 <type 'tables.tableExtension.Row'>
> >> 2 12 <type 'tables.tableExtension.Row'>
> >> 2 12 <type 'tables.tableExtension.Row'>
> >>
> >>
> >> The other two methods I tried are as follows. In these methods, I try
> >> to get the row index number from the inner iterator and then reference
> >> the records with these index numbers.
> >>
> >>
> >> def iter3(tbl):
> >>    print
> >>    print "****Attempt 3**** - Iterate values by subgroup w/ records
> >> in subgroups sorted"
> >>    print "THIS WORKs, BUT TERRIBLY SLOW!"
> >>
> >>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
> >>        rows = [x.nrow for x in m]
> >>        sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value'])
> >>        for i in sorted_rows:
> >>            v = tbl[i]
> >>            print v['key'], v['value'], type(v)
> >>
> >> def iter4(tbl):
> >>    print
> >>    print "****Attempt 4**** - Iterate values by subgroup w/ records
> >> in subgroups sorted"
> >>    print "THIS WORKs, BUT TERRIBLY SLOW, TOO!"
> >>
> >>    for k1, m in itertools.groupby(tbl,lambda x: x['key']):
> >>        rows = [x.nrow for x in m]
> >>        sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value'])
> >>        for v in tbl.itersequence(sorted_rows):
> >>            print v['key'], v['value'], type(v)
> >>
> >>
> >> These two methods seem to give the correct results, but they are
> >> terribly slow. They are about 10-20 times slower than the original
> >> iterator version.
> >>
> >>
> >> ****Attempt 3**** - Iterate values by subgroup w/ records in subgroups
> >> sorted
> >> THIS WORKs, BUT TERRIBLY SLOW!
> >> 1 12 <type 'numpy.void'>
> >> 1 11 <type 'numpy.void'>
> >> 1 10 <type 'numpy.void'>
> >> 2 12 <type 'numpy.void'>
> >> 2 11 <type 'numpy.void'>
> >> 2 10 <type 'numpy.void'>
> >>
> >> ****Attempt 4**** - Iterate values by subgroup w/ records in subgroups
> >> sorted
> >> THIS WORKs, BUT TERRIBLY SLOW, TOO!
> >> 1 12 <type 'tables.tableExtension.Row'>
> >> 1 11 <type 'tables.tableExtension.Row'>
> >> 1 10 <type 'tables.tableExtension.Row'>
> >> 2 12 <type 'tables.tableExtension.Row'>
> >> 2 11 <type 'tables.tableExtension.Row'>
> >> 2 10 <type 'tables.tableExtension.Row'>
> >>
> >>
> >>
> >> My questions are:
> >>
> >> 1. Is there any better way to do this?
> >> 2. Why method 1 and 2 fail?
> >> 3. In the last two methods, notice that the types of v are different.
> >> One is numpy.void and the other is 'tables.tableExtension.Row'. In
> >> this example, they are used the same way, but when there are nested
> >> structs, they are used differently -- with the former you will do
> >> v['foo']['bar'] and with the latter, you will do v['foo/bar']. Why is
> >> this the case?
> >> 4. If I want to copy part of the table into memory, what is the best
> >> way of doing this?
> >>
> >> Thanks,
> >> Geoffrey
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> All of the data generated in your IT infrastructure is seriously
> valuable.
> >> Why? It contains a definitive record of application performance,
> security
> >> threats, fraudulent activity, and more. Splunk takes this data and makes
> >> sense of it. IT sense. And common sense.
> >> http://p.sf.net/sfu/splunk-d2d-c2
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pytables-users@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> ------------------------------------------------------------------------------
> > All of the data generated in your IT infrastructure is seriously
> valuable.
> > Why? It contains a definitive record of application performance, security
> > threats, fraudulent activity, and more. Splunk takes this data and makes
> > sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-d2d-c2
> > _______________________________________________
> > Pytables-users mailing list
> > Pytables-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
>
>
> ------------------------------------------------------------------------------
> All of the data generated in your IT infrastructure is seriously valuable.
> Why? It contains a definitive record of application performance, security
> threats, fraudulent activity, and more. Splunk takes this data and makes
> sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-d2d-c2
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security 
threats, fraudulent activity, and more. Splunk takes this data and makes 
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2d-c2

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] A Few Questions About Iterating Through PyTables

Reply via email to