On Mon, Jan 25, 2010 at 5:16 PM, Keith Goodman <kwgood...@gmail.com> wrote: > On Mon, Jan 25, 2010 at 1:38 PM, Jan Strube <curious...@gmail.com> wrote: >> Dear List, >> >> I'm trying to speed up a piece of code that selects a subsample based on >> some criteria: >> Setup: >> I have two samples, raw and cut. Cut is a pure subset of raw, all elements >> in cut are also in raw, and cut is derived from raw by applying some cuts. >> Now I would like to select a random subsample of raw and find out how many >> are also in cut. In other words, some of those random events pass the cuts, >> others don't. >> So in principle I have >> >> randomSample = np.random.random_integers(0, len(raw)-1, size=sampleSize) >> random_that_pass1 = [r for r in raw[randomSample] if r in cut] >> >> This is fine (I hope), but slow. > > You could construct raw2 and cut2 where each element placed in cut2 is > removed from raw2: > > idx = np.random.rand(n_in_cut2) > 0.5 # for example > raw2 = raw[~idx] > cut2 = raw[idx] > > If you concatenate raw2 and cut2 you get raw (but reordered): > > raw3 = np.concatenate((raw2, cut2), axis=0) > > Any element in the subsample with an index of len(raw2) or greater is > in cut. That makes counting fast. > > There is a setup cost. So I guess it all depends on how many > subsamples you need from one cut. > > Not sure any of this works, just an idea. > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
in1d or intersect in arraysetops should also work, pure python but well constructed and tested for performance. Josef _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion