mclovin wrote:
On Jul 4, 12:51 pm, Scott David Daniels <scott.dani...@acm.org> wrote:
mclovin wrote:
OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.
Is there a "bet guess" strategy where it is not 100% accurate but much
faster?
Well, I timed a run of a version of mine, and the scan is approx 5X
longer than the copy-and-sort. Time arr_of_arr.flatten().sort() to
see how quickly the copy and sort happens.So you could try a variant
exploiting the following property:
If you know the minimum length of a run that will be in the top 25,
then the value for each of the most-frequent run entries must show up at
positions n * stride and (n + 1) * stride (for some n). That should
drastically reduce the scan cost, as long as stride is reasonably large....
sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
So there are only 1000 points to investigate.
With any distribution other than uniform, that should go _way_ down.
So just pull out those points, use bisect to get their frequencies, and
feed those results into the heap accumulation.
--Scott David Daniels
I dont quite understand what you are saying but I know this: the times
the most common element appears varies greatly. Sometimes it appears
over 1000 times, and some times it appears less than 50. It all
depends on the density of the arrays I am analyzing.
Here's a heuristic replacement for my previous frequency code:
I've tried to mark where you could fudge numbers if the run time
is at all close.
def frequency(arr_of_arr, N, stride=100)
'''produce (freq, value) pairs for data in arr_of_arr.
Tries to produce > N pairs. stride is a guess at half
the length of the shortest run in the top N runs.
'''
# if the next two lines are too slow, this whole approach is toast
data = arr_of_arr.flatten() # big allocation
data.sort() # a couple of seconds for 25 million ints
# stride is a length forcing examination of a run.
sampled = data[::stride]
# Note this is a view into data, and is still sorted.
# We know that any run of length 2 * stride - 1 in data _must_ have
# consecutive entries in sampled. Compare them "in parallel"
matches = sampled[:-1] == sampled[1:]
# matches is True or False for stride-separated values from sampled
candidates = sum(matches) # count identified matches
# while candidates is huge, keep trying with a larger stride
while candidates > N *10: # 10 -- heuristic
stride *= 2 # # heuristic increase
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)
# if we got too few, move stride down:
while candidates < N * 3: # heuristic slop for long runs
stride //= 2 # heuristic decrease
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)
# Here we have a "nice" list of candidates that is likely
# to include every run we actually want. sampled[matches] is
# the sorted list of candidate values. It may have duplicates
former = None
past = 0
# In the loop here we only use sampled to the pick values we
# then go find in data. We avoid checking for same value twice
for value in sampled[matches]:
if value == former:
continue # A long run: multiple matches in sampled
former = value # Make sure we only try this one once
# find the beginning of the run
start = bisect.bisect_left(data, value, past)
# find the end of the run (we know it is at least stride long)
past = bisect.bisect_right(data, value, start + stride)
yield past - start, value # produce frequency, value data
--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list