On Tue, Jun 1, 2010 at 1:51 PM, Wes McKinney <wesmck...@gmail.com> wrote: > On Tue, Jun 1, 2010 at 4:49 PM, Zachary Pincus <zachary.pin...@yale.edu> > wrote: >>> Hi >>> Can anyone think of a clever (non-lopping) solution to the following? >>> >>> A have a list of latitudes, a list of longitudes, and list of data >>> values. All lists are the same length. >>> >>> I want to compute an average of data values for each lat/lon pair. >>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then >>> data[1001] = (data[1001] + data[2001])/2 >>> >>> Looping is going to take wayyyy to long. >> >> As a start, are the "equal" lat/lon pairs exactly equal (i.e. either >> not floating-point, or floats that will always compare equal, that is, >> the floating-point bit-patterns will be guaranteed to be identical) or >> approximately equal to float tolerance? >> >> If you're in the approx-equal case, then look at the KD-tree in scipy >> for doing near-neighbors queries. >> >> If you're in the exact-equal case, you could consider hashing the lat/ >> lon pairs or something. At least then the looping is O(N) and not >> O(N^2): >> >> import collections >> grouped = collections.defaultdict(list) >> for lt, ln, da in zip(lat, lon, data): >> grouped[(lt, ln)].append(da) >> >> averaged = dict((ltln, numpy.mean(da)) for ltln, da in grouped.items()) >> >> Is that fast enough? >> >> Zach >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > This is a pretty good example of the "group-by" problem that will > hopefully work its way into a future edition of NumPy. Given that, a > good approach would be to produce a unique key from the lat and lon > vectors, and pass that off to the groupby routine (when it exists). > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
meanwhile groupby from itertools will work but might be a bit slower since it'll have to convert every row to tuple and group in a list. import numpy as np import itertools # fake data N = 10000 lats = np.repeat(180 * (np.random.ranf(N/ 250) - 0.5), 250) lons = np.repeat(360 * (np.random.ranf(N/ 250) - 0.5), 250) np.random.shuffle(lats) np.random.shuffle(lons) vals = np.arange(N) ##################################### inds = np.lexsort((lons, lats)) sorted_lats = lats[inds] sorted_lons = lons[inds] sorted_vals = vals[inds] llv = np.array((sorted_lats, sorted_lons, sorted_vals)).T for (lat, lon), group in itertools.groupby(llv, lambda row: tuple(row[:2])): group_vals = [g[-1] for g in group] print lat, lon, np.mean(group_vals) # make sure the mean for the last lat/lon from the loop matches the mean # for that lat/lon from original data. tests_idx, = np.where((lats == lat) & (lons == lon)) assert np.mean(vals[tests_idx]) == np.mean(group_vals) _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion