On Sat, 04 Jul 2009 07:19:48 -0700, Scott David Daniels wrote: > Actually the next step is to maintain a min-heap as you run down the > sorted array. Something like:
Not bad. I did some tests on it, using the following sample data: arr = np.array([xrange(i, i+7000) for i in xrange(143)] + [[750]*7000] + [xrange(3*i, 3*i+7000) for i in xrange(142)]) and compared your code against the following simple function: def count(arr, N): D = {} for v in arr: for x in v: D[x] = D.get(x, 0) + 1 freq = [] for el, f in D.iteritems(): freq.append((f, el)) return sorted(freq, reverse=True)[:N] As a rough figure, your min-heap code is approximately twice as fast as mine. To the OP: I think you should start profiling your code and find out exactly *where* it is slow and concentrate on that. I think that trying a heuristic to estimate the most frequent elements by taking a random sample is likely to be a mistake -- whatever you're trying to accomplish with the frequency counts, the use of such a heuristic will mean that you're only approximately accomplishing it. -- Steven -- http://mail.python.org/mailman/listinfo/python-list