mclovin wrote:
OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.
Is there a "bet guess" strategy where it is not 100% accurate but much
faster?
Well, I timed a run of a version of mine, and the scan is approx 5X
longer than the copy-and-sort. Time arr_of_arr.flatten().sort() to
see how quickly the copy and sort happens.So you could try a variant
exploiting the following property:
If you know the minimum length of a run that will be in the top 25,
then the value for each of the most-frequent run entries must show up at
positions n * stride and (n + 1) * stride (for some n). That should
drastically reduce the scan cost, as long as stride is reasonably large.
For my uniformly distributed 0..1024 values in 5M x 5M array,
About 2.5 sec to flatten and sort.
About 15 sec to run one of my heapish thingies.
the least frequency encountered: 24716
so, with stride at
sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
So there are only 1000 points to investigate.
With any distribution other than uniform, that should go _way_ down.
So just pull out those points, use bisect to get their frequencies, and
feed those results into the heap accumulation.
--Scott David Daniels
--
http://mail.python.org/mailman/listinfo/python-list