Re: [Numpy-discussion] 2D binning

Mathew Yeates Wed, 02 Jun 2010 11:09:54 -0700

I'm on Windows, using a precompiled binary. I never built numpy/scipy on
Windows.


On Wed, Jun 2, 2010 at 10:45 AM, Wes McKinney <wesmck...@gmail.com> wrote:

> On Wed, Jun 2, 2010 at 1:23 PM, Mathew Yeates <mat.yea...@gmail.com>
> wrote:
> > thanks. I am also getting an error in ndi.mean
> > Were you getting the error
> > "RuntimeError: data type not supported"?
> >
> > -Mathew
> >
> > On Wed, Jun 2, 2010 at 9:40 AM, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >> On Wed, Jun 2, 2010 at 3:41 AM, Vincent Schut <sc...@sarvision.nl>
> wrote:
> >> > On 06/02/2010 04:52 AM, josef.p...@gmail.com wrote:
> >> >> On Tue, Jun 1, 2010 at 9:57 PM, Zachary Pincus<
> zachary.pin...@yale.edu>
> >> >>  wrote:
> >> >>>> I guess it's as fast as I'm going to get. I don't really see any
> >> >>>> other way. BTW, the lat/lons are integers)
> >> >>>
> >> >>> You could (in c or cython) try a brain-dead "hashtable" with no
> >> >>> collision detection:
> >> >>>
> >> >>> for lat, long, data in dataset:
> >> >>>    bin = (lat ^ long) % num_bins
> >> >>>    hashtable[bin] = update_incremental_mean(hashtable[bin], data)
> >> >>>
> >> >>> you'll of course want to do some experiments to see if your data are
> >> >>> sufficiently sparse and/or you can afford a large enough hashtable
> >> >>> array that you won't get spurious hash collisions. Adding error-
> >> >>> checking to ensure that there are no collisions would be pretty
> >> >>> trivial (just keep a table of the lat/long for each hash value,
> which
> >> >>> you'll need anyway, and check that different lat/long pairs don't
> get
> >> >>> assigned the same bin).
> >> >>>
> >> >>> Zach
> >> >>>
> >> >>>
> >> >>>
> >> >>>> -Mathew
> >> >>>>
> >> >>>> On Tue, Jun 1, 2010 at 1:49 PM, Zachary
> >> >>>> Pincus<zachary.pin...@yale.edu
> >> >>>>> wrote:
> >> >>>>> Hi
> >> >>>>> Can anyone think of a clever (non-lopping) solution to the
> >> >>>> following?
> >> >>>>>
> >> >>>>> A have a list of latitudes, a list of longitudes, and list of data
> >> >>>>> values. All lists are the same length.
> >> >>>>>
> >> >>>>> I want to compute an average  of data values for each lat/lon
> pair.
> >> >>>>> e.g. if lat[1001] lon[1001] = lat[2001] [lon [2001] then
> >> >>>>> data[1001] = (data[1001] + data[2001])/2
> >> >>>>>
> >> >>>>> Looping is going to take wayyyy to long.
> >> >>>>
> >> >>>> As a start, are the "equal" lat/lon pairs exactly equal (i.e.
> either
> >> >>>> not floating-point, or floats that will always compare equal, that
> >> >>>> is,
> >> >>>> the floating-point bit-patterns will be guaranteed to be identical)
> >> >>>> or
> >> >>>> approximately equal to float tolerance?
> >> >>>>
> >> >>>> If you're in the approx-equal case, then look at the KD-tree in
> scipy
> >> >>>> for doing near-neighbors queries.
> >> >>>>
> >> >>>> If you're in the exact-equal case, you could consider hashing the
> >> >>>> lat/
> >> >>>> lon pairs or something. At least then the looping is O(N) and not
> >> >>>> O(N^2):
> >> >>>>
> >> >>>> import collections
> >> >>>> grouped = collections.defaultdict(list)
> >> >>>> for lt, ln, da in zip(lat, lon, data):
> >> >>>>    grouped[(lt, ln)].append(da)
> >> >>>>
> >> >>>> averaged = dict((ltln, numpy.mean(da)) for ltln, da in
> >> >>>> grouped.items())
> >> >>>>
> >> >>>> Is that fast enough?
> >> >>
> >> >> If the lat lon can be converted to a 1d label as Wes suggested, then
> >> >> in a similar timing exercise ndimage was the fastest.
> >> >> http://mail.scipy.org/pipermail/scipy-user/2009-February/019850.html
> >> >
> >> > And as you said your lats and lons are integers, you could simply do
> >> >
> >> > ll = lat*1000 + lon
> >> >
> >> > to get unique 'hashes' or '1d labels' for you latlon pairs, as a lat
> or
> >> > lon will never exceed 360 (degrees).
> >> >
> >> > After that, either use the ndimage approach, or you could use
> >> > histogramming with weighting by data values and divide by histogram
> >> > withouth weighting, or just loop.
> >> >
> >> > Vincent
> >> >
> >> >>
> >> >> (this was for python 2.4, also later I found np.bincount which
> >> >> requires that the labels are consecutive integers, but is as fast as
> >> >> ndimage)
> >> >>
> >> >> I don't know how it would compare to the new suggestions.
> >> >>
> >> >> Josef
> >> >>
> >> >>
> >> >>
> >> >>>>
> >> >>>> Zach
> >> >>>> _______________________________________________
> >> >>>> NumPy-Discussion mailing list
> >> >>>> NumPy-Discussion@scipy.org
> >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >>>>
> >> >>>> _______________________________________________
> >> >>>> NumPy-Discussion mailing list
> >> >>>> NumPy-Discussion@scipy.org
> >> >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >>>
> >> >>> _______________________________________________
> >> >>> NumPy-Discussion mailing list
> >> >>> NumPy-Discussion@scipy.org
> >> >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >>>
> >> >
> >> > _______________________________________________
> >> > NumPy-Discussion mailing list
> >> > NumPy-Discussion@scipy.org
> >> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >> >
> >>
> >> I was curious about how fast ndimage was for this operation so here's
> >> the complete function.
> >>
> >> import scipy.ndimage as ndi
> >>
> >> N = 10000
> >>
> >> lat = np.random.randint(0, 360, N)
> >> lon = np.random.randint(0, 360, N)
> >> data = np.random.randn(N)
> >>
> >> def group_mean(lat, lon, data):
> >>    indexer = np.lexsort((lon, lat))
> >>    lat = lat.take(indexer)
> >>    lon = lon.take(indexer)
> >>    sorted_data = data.take(indexer)
> >>
> >>    keys = 1000 * lat + lon
> >>    unique_keys = np.unique(keys)
> >>
> >>    result = ndi.mean(sorted_data, labels=keys, index=unique_keys)
> >>    decoder = keys.searchsorted(unique_keys)
> >>
> >>    return dict(zip(zip(lat.take(decoder), lon.take(decoder)), result))
> >>
> >> Appears to be about 13x faster (and could be made faster still) than
> >> the naive version on my machine:
> >>
> >> def group_mean_naive(lat, lon, data):
> >>    grouped = collections.defaultdict(list)
> >>    for lt, ln, da in zip(lat, lon, data):
> >>      grouped[(lt, ln)].append(da)
> >>
> >>    averaged = dict((ltln, np.mean(da)) for ltln, da in grouped.items())
> >>
> >>    return averaged
> >>
> >> I had to get the latest scipy trunk to not get an error from
> ndimage.mean
> >> _______________________________________________
> >> NumPy-Discussion mailing list
> >> NumPy-Discussion@scipy.org
> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
> > _______________________________________________
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
> That's the error I was getting. Depending on your OS upgrading to the
> scipy trunk should be the easiest fix.
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] 2D binning

Reply via email to