Re: finding most common elements between thousands of multiple arrays.

Steven D'Aprano Sat, 04 Jul 2009 18:44:55 -0700

On Sat, 04 Jul 2009 07:19:48 -0700, Scott David Daniels wrote:

> Actually the next step is to maintain a min-heap as you run down the
> sorted array.  Something like:


Not bad.

I did some tests on it, using the following sample data:

arr = np.array([xrange(i, i+7000) for i in xrange(143)] + 
      [[750]*7000] + [xrange(3*i, 3*i+7000) for i in xrange(142)])


and compared your code against the following simple function:


def count(arr, N):
    D = {}
    for v in arr:
        for x in v:
            D[x] = D.get(x, 0) + 1
    freq = []
    for el, f in D.iteritems():
        freq.append((f, el))
    return sorted(freq, reverse=True)[:N]


As a rough figure, your min-heap code is approximately twice as fast as 
mine.

To the OP: I think you should start profiling your code and find out 
exactly *where* it is slow and concentrate on that. I think that trying a 
heuristic to estimate the most frequent elements by taking a random 
sample is likely to be a mistake -- whatever you're trying to accomplish 
with the frequency counts, the use of such a heuristic will mean that 
you're only approximately accomplishing it.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: finding most common elements between thousands of multiple arrays.

Reply via email to