mclovin wrote:
On Jul 4, 12:51 pm, Scott David Daniels <scott.dani...@acm.org> wrote:
mclovin wrote:
OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.
Is there a "bet guess" strategy where it is not 100% accurate but much
faster?

Well, I timed a run of a version of mine, and the scan is approx 5X
longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
see how quickly the copy and sort happens.So you could try a variant
exploiting the following property:
     If you know the minimum length of a run that will be in the top 25,
then the value for each of the most-frequent run entries must show up at
positions n * stride and (n + 1) * stride (for some n).  That should
drastically reduce the scan cost, as long as stride is reasonably large....

sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
So there are only 1000 points to investigate.
With any distribution other than uniform, that should go _way_ down.
So just pull out those points, use bisect to get their frequencies, and
feed those results into the heap accumulation.

--Scott David Daniels

I dont quite understand what you are saying but I know this: the times
the most common element appears varies greatly. Sometimes it appears
over 1000 times, and some times it appears less than 50. It all
depends on the density of the arrays I am analyzing.

Here's a heuristic replacement for my previous frequency code:
I've tried to mark where you could fudge numbers if the run time
is at all close.

def frequency(arr_of_arr, N, stride=100)
    '''produce (freq, value) pairs for data in arr_of_arr.

    Tries to produce > N pairs.  stride is a guess at half
    the length of the shortest run in the top N runs.
    '''
    # if the next two lines are too slow, this whole approach is toast
    data = arr_of_arr.flatten()  # big allocation
    data.sort() # a couple of seconds for 25 million ints

    # stride is a length forcing examination of a run.
    sampled = data[::stride]
    # Note this is a view into data, and is still sorted.
    # We know that any run of length 2 * stride - 1 in data _must_ have
    # consecutive entries in sampled.  Compare them "in parallel"
    matches = sampled[:-1] == sampled[1:]
    # matches is True or False for stride-separated values from sampled
    candidates = sum(matches) # count identified matches

    # while candidates is huge, keep trying with a larger stride
    while candidates > N *10: # 10 -- heuristic
        stride *= 2 # # heuristic increase
        sampled = data[::stride]
        matches = sampled[:-1] == sampled[1:]
        candidates = sum(matches)

    # if we got too few, move stride down:
    while candidates < N * 3: # heuristic slop for long runs
        stride //= 2 # heuristic decrease
        sampled = data[::stride]
        matches = sampled[:-1] == sampled[1:]
        candidates = sum(matches)

    # Here we have a "nice" list of candidates that is likely
    # to include every run we actually want. sampled[matches] is
    # the sorted list of candidate values.  It may have duplicates
    former = None
    past = 0
    # In the loop here we only use sampled to the pick values we
    # then go find in data.  We avoid checking for same value twice
    for value in sampled[matches]:
        if value == former:
            continue # A long run: multiple matches in sampled
        former = value # Make sure we only try this one once
        # find the beginning of the run
        start = bisect.bisect_left(data, value, past)
        # find the end of the run (we know it is at least stride long)
        past = bisect.bisect_right(data, value, start + stride)
        yield past - start, value # produce frequency, value data

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to