Re: finding most common elements between thousands of multiple arrays.

Scott David Daniels Sat, 04 Jul 2009 12:51:25 -0700

mclovin wrote:

OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.


Is there a "bet guess" strategy where it is not 100% accurate but much
faster?


Well, I timed a run of a version of mine, and the scan is approx 5X
longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
see how quickly the copy and sort happens.So you could try a variant
exploiting the following property:
    If you know the minimum length of a run that will be in the top 25,
then the value for each of the most-frequent run entries must show up at
positions n * stride and (n + 1) * stride (for some n).  That should
drastically reduce the scan cost, as long as stride is reasonably large.

For my uniformly distributed 0..1024 values in 5M x 5M array,
About 2.5 sec to flatten and sort.
About 15 sec to run one of my heapish thingies.
the least frequency encountered: 24716
so, with stride at

sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
So there are only 1000 points to investigate.
With any distribution other than uniform, that should go _way_ down.

So just pull out those points, use bisect to get their frequencies, andfeed those results into the heap accumulation.


--Scott David Daniels
--
http://mail.python.org/mailman/listinfo/python-list

Re: finding most common elements between thousands of multiple arrays.

Reply via email to