Hi All, I am trying to iterate through records in a pytable. The records in the table are ordered by a key. I need to first divide the records into groups as defined by the key, then iterate through each group, and finally iterate through records in each group. The below code does exactly this:
def iter0(tbl): print "***Iter0 - Iterate records by subgroup****" for k1, m in itertools.groupby(tbl,lambda x: x['key']): for v in m: print v['key'], v['value'], type(v) The complexity comes in when I try to iterate through records in each subgroup in a particular order, i.e., if I want to sort the records in each group and then iterate through them. Let me generate some fake data and then go through the four ways I tried. None of them are ideal. This code generates some fake data for our tests. hf = tables.openFile('sample.h5','w') # Generate some data class SampleRecord(tables.IsDescription): key = tables.Int32Col() value = tables.Int32Col() hf.createTable("/", "samples", SampleRecord, "samples") for j in range(1, 3): for i in range(10,13): row = hf.root.samples.row row['key'] = j row['value'] = i row.append() hf.root.samples.flush() hf.flush() The first method I tried is as follows. This looks exactly like the previous code, but in the inner loop, I use "for v in sorted(m,key=lambda x: -x['value'])" instead of "for v in m." def iter1(tbl): print print "****Attempt 1**** - Iterate values by subgroup w/ records in subgroups sorted" print "THIS DOES NOT WORK" for k1, m in itertools.groupby(tbl,lambda x: x['key']): for v in sorted(m,key=lambda x: -x['value']): print v['key'], v['value'], type(v) However, this gives the wrong results, as follows. I don't know what it does not work. ****Attempt 1**** - Iterate values by subgroup w/ records in subgroups sorted THIS DOES NOT WORK 2 10 <type 'tables.tableExtension.Row'> 2 10 <type 'tables.tableExtension.Row'> 2 10 <type 'tables.tableExtension.Row'> 2 12 <type 'tables.tableExtension.Row'> 2 12 <type 'tables.tableExtension.Row'> 2 12 <type 'tables.tableExtension.Row'> The second method I tried is as follows. I try to copy what is in the inner iterator into a list and then sort the list. def iter2(tbl): print print "****Attempt 2**** - Iterate values by subgroup w/ records in subgroups sorted" print "THIS DOES NOT WORK, EITHER" for k1, m in itertools.groupby(tbl,lambda x: x['key']): temp_list = list(m) temp_list2 = sorted(temp_list, key=lambda x: -x['value']) for v in temp_list2: print v['key'], v['value'], type(v) This does not work either. The results are similar to the last one. ****Attempt 2**** - Iterate values by subgroup w/ records in subgroups sorted THIS DOES NOT WORK, EITHER 2 10 <type 'tables.tableExtension.Row'> 2 10 <type 'tables.tableExtension.Row'> 2 10 <type 'tables.tableExtension.Row'> 2 12 <type 'tables.tableExtension.Row'> 2 12 <type 'tables.tableExtension.Row'> 2 12 <type 'tables.tableExtension.Row'> The other two methods I tried are as follows. In these methods, I try to get the row index number from the inner iterator and then reference the records with these index numbers. def iter3(tbl): print print "****Attempt 3**** - Iterate values by subgroup w/ records in subgroups sorted" print "THIS WORKs, BUT TERRIBLY SLOW!" for k1, m in itertools.groupby(tbl,lambda x: x['key']): rows = [x.nrow for x in m] sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value']) for i in sorted_rows: v = tbl[i] print v['key'], v['value'], type(v) def iter4(tbl): print print "****Attempt 4**** - Iterate values by subgroup w/ records in subgroups sorted" print "THIS WORKs, BUT TERRIBLY SLOW, TOO!" for k1, m in itertools.groupby(tbl,lambda x: x['key']): rows = [x.nrow for x in m] sorted_rows = sorted(rows, key = lambda x: -tbl[x]['value']) for v in tbl.itersequence(sorted_rows): print v['key'], v['value'], type(v) These two methods seem to give the correct results, but they are terribly slow. They are about 10-20 times slower than the original iterator version. ****Attempt 3**** - Iterate values by subgroup w/ records in subgroups sorted THIS WORKs, BUT TERRIBLY SLOW! 1 12 <type 'numpy.void'> 1 11 <type 'numpy.void'> 1 10 <type 'numpy.void'> 2 12 <type 'numpy.void'> 2 11 <type 'numpy.void'> 2 10 <type 'numpy.void'> ****Attempt 4**** - Iterate values by subgroup w/ records in subgroups sorted THIS WORKs, BUT TERRIBLY SLOW, TOO! 1 12 <type 'tables.tableExtension.Row'> 1 11 <type 'tables.tableExtension.Row'> 1 10 <type 'tables.tableExtension.Row'> 2 12 <type 'tables.tableExtension.Row'> 2 11 <type 'tables.tableExtension.Row'> 2 10 <type 'tables.tableExtension.Row'> My questions are: 1. Is there any better way to do this? 2. Why method 1 and 2 fail? 3. In the last two methods, notice that the types of v are different. One is numpy.void and the other is 'tables.tableExtension.Row'. In this example, they are used the same way, but when there are nested structs, they are used differently -- with the former you will do v['foo']['bar'] and with the latter, you will do v['foo/bar']. Why is this the case? 4. If I want to copy part of the table into memory, what is the best way of doing this? Thanks, Geoffrey ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users