Hi, Does anyone have any suggestions for summarising data in numpy?
The quick description is that I want to do something like the SQL statement: SELECT sum(field1), sum(field2) FROM table GROUP BY field3; The more accurate description is that my data is stored in PyTables HDF format, with 24 monthly files, each with 4m records describing how customers performed that month. Each record looks something like this: ('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0, 2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75, 87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95) The first 5 fields are status fields (month_string, account_number, product_code, account_status, months_since_customer_joined). The remaining 48 fields represent different aspects of the customer's performance during that month. I read 100,000 of these records at a time and turn them into a numpy recarray with: dat = hdf_table.read(start=pos, stop=pos+block_size) dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr) I'd like to reduce these 96m records x 53 fields down to monthly averages for each tuple (month_string, months_since_customer_joined) which in the case above is ('200604', 12L). This will let me compare the performance of newly acquired customers at the same point in their lifecycle as customers acquired 1 or 2 years ago. The end result should be a dataset something like res[month_index, months_since_customer_joined] = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ... sum_field_52 ]) with a shape of (24, 24, 49). I've played around with lexsort(), take(), sum(), etc, but get very confused and end up feeling that I'm making things more complicated than they need to be. So any advice from numpy veterans on how best to proceed would be very welcome! Cheers Stephen ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion