I suggest lexsort itertools.groupby of the indices take
I think it would be really great if numpy had the first two as a function or something like that. It is really useful to be able to take an array and bucket it and apply further numpy operations like accumulation functions. On 7/18/06, Stephen Simmons <[EMAIL PROTECTED]> wrote: > Hi, > > Does anyone have any suggestions for summarising data in numpy? > > The quick description is that I want to do something like the SQL statement: > SELECT sum(field1), sum(field2) FROM table GROUP BY field3; > > The more accurate description is that my data is stored in PyTables HDF > format, with 24 monthly files, each with 4m records describing how > customers performed that month. Each record looks something like this: > ('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0, > 2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75, > 87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95) > The first 5 fields are status fields (month_string, account_number, > product_code, account_status, months_since_customer_joined). The > remaining 48 fields represent different aspects of the customer's > performance during that month. I read 100,000 of these records at a time > and turn them into a numpy recarray with: > dat = hdf_table.read(start=pos, stop=pos+block_size) > dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr) > > I'd like to reduce these 96m records x 53 fields down to monthly > averages for each tuple (month_string, months_since_customer_joined) > which in the case above is ('200604', 12L). This will let me compare the > performance of newly acquired customers at the same point in their > lifecycle as customers acquired 1 or 2 years ago. > > The end result should be a dataset something like > res[month_index, months_since_customer_joined] > = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ... > sum_field_52 ]) > with a shape of (24, 24, 49). > > I've played around with lexsort(), take(), sum(), etc, but get very > confused and end up feeling that I'm making things more complicated than > they need to be. So any advice from numpy veterans on how best to > proceed would be very welcome! > > Cheers > > Stephen > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion