Someone inquired about this one today and I wanted to clarify there is now a better way to do this that I didn't know about when I posted the original:
>>> ind = numpy.array([0,0,0,0,1,1,1,2,2,2,]) >>> data = numpy.arange(10) >>> borders = numpy.arange(len(ind)).compress(numpy.hstack([[1], >>> ind[1:]!=ind[:-1]])) >>> numpy.add.reduceat(data, borders) array([ 6, 15, 24]) On Tue, Jul 18, 2006 at 8:49 AM, Tom Denniston <[email protected]> wrote: > I suggest > > lexsort > itertools.groupby of the indices > take > > I think it would be really great if numpy had the first two as a > function or something like that. It is really useful to be able to > take an array and bucket it and apply further numpy operations like > accumulation functions. > > On 7/18/06, Stephen Simmons <[email protected]> wrote: >> >> Hi, >> >> Does anyone have any suggestions for summarising data in numpy? >> >> The quick description is that I want to do something like the SQL >> statement: >> SELECT sum(field1), sum(field2) FROM table GROUP BY field3; >> >> The more accurate description is that my data is stored in PyTables HDF >> format, with 24 monthly files, each with 4m records describing how >> customers performed that month. Each record looks something like this: >> ('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0, >> 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, >> 8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0, >> 2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75, >> 87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95) >> The first 5 fields are status fields (month_string, account_number, >> product_code, account_status, months_since_customer_joined). The >> remaining 48 fields represent different aspects of the customer's >> performance during that month. I read 100,000 of these records at a time >> and turn them into a numpy recarray with: >> dat = hdf_table.read(start=pos, stop=pos+block_size) >> dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr) >> >> I'd like to reduce these 96m records x 53 fields down to monthly >> averages for each tuple (month_string, months_since_customer_joined) >> which in the case above is ('200604', 12L). This will let me compare the >> performance of newly acquired customers at the same point in their >> lifecycle as customers acquired 1 or 2 years ago. >> >> The end result should be a dataset something like >> res[month_index, months_since_customer_joined] >> = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ... >> sum_field_52 ]) >> with a shape of (24, 24, 49). >> >> I've played around with lexsort(), take(), sum(), etc, but get very >> confused and end up feeling that I'm making things more complicated than >> they need to be. So any advice from numpy veterans on how best to >> proceed would be very welcome! >> >> Cheers >> >> Stephen >> >> ------------------------------------------------------------------------- >> Take Surveys. Earn Cash. Influence the Future of IT >> Join SourceForge.net's Techsay panel and you'll get the chance to share >> your >> opinions on IT & business topics through brief surveys -- and earn cash >> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV >> _______________________________________________ >> Numpy-discussion mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/numpy-discussion >> > _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
