Re: finding most common elements between thousands of multiple arrays.

Scott David Daniels Sun, 05 Jul 2009 17:27:45 -0700

Scott David Daniels wrote:

... Here's a heuristic replacement for my previous frequency code:
I've tried to mark where you could fudge numbers if the run time
is at all close.


Boy, I cannot let go.  I did a bit of a test checking for cost to
calculated number of discovered samples, and found after:
    import timeit
    import numpy
    original = numpy.random.random(0, 100, (1000, 1000)).astype(int)
    data = original.flatten()
    data.sort()
    part = data[::100]
    t = timeit.Timer('sum(part[:-1]==part[1:])',
                     'from __main__ import part')
    v = timeit.Timer('len(part[part[:-1]==part[1:]])',
                     'from __main__ import part')

I got:
>>> t.repeat(3, 10)
[0.58319842326318394, 0.57617574300638807, 0.57831819407238072]
>>> v.repeat(3, 1000)
[0.93933027801040225, 0.93704535073584339, 0.94096260837613954]

So, len(part[mask]) is almost 50X faster!  I checked:
>>> sum(part[:-1]==part[1:])
9393
>>> len(part[part[:-1]==part[1:]])
9393

That's an awful lot of matches, so I with high selectivity:
    data = original.flatten()  # no sorting, so runs missing
    part = data[::100]

>>> t.repeat(3, 10)
[0.58641335700485797, 0.58458854407490435, 0.58872594142576418]
>>> v.repeat(3, 1000)
[0.27352554584422251, 0.27375686015921019, 0.27433291102624935]

about 200X faster

>>> len(part[part[:-1]==part[1:]])
39
>>> sum(part[:-1]==part[1:])
39

So my new version of this (compressed) code:

    ...
    sampled = data[::stride]
    matches = sampled[:-1] == sampled[1:]
    candidates = sum(matches) # count identified matches
    while candidates > N * 10: # 10 -- heuristic
        stride *= 2 # # heuristic increase
        sampled = data[::stride]
        matches = sampled[:-1] == sampled[1:]
        candidates = sum(matches)
    while candidates < N * 3: # heuristic slop for long runs
        stride //= 2 # heuristic decrease
        sampled = data[::stride]
        matches = sampled[:-1] == sampled[1:]
        candidates = sum(matches)
    former = None
    past = 0
    for value in sampled[matches]:
        ...

is:
      ...
      sampled = data[::stride]
      candidates = sampled[sampled[:-1] == sampled[1:]]
      while len(candidates) > N * 10: # 10 -- heuristic
          stride *= 2 # # heuristic increase
          sampled = data[::stride]
          candidates = sampled[sampled[:-1] == sampled[1:]]
      while len(candidates) < N * 3: # heuristic slop for long runs
          stride //= 2 # heuristic decrease
          sampled = data[::stride]
          candidates = sampled[sampled[:-1] == sampled[1:]]
      former = None
      past = 0
      for value in candidates:
          ...
This change is important, for we try several strides before
settling on a choice, meaning the optimization can be valuable.
This also means we could be pickier at choosing strides (try
more values), since checking is cheaper than before.

Summary: when dealing with numpy, (or any bulk <-> individual values
transitions), try several ways that you think are equivalent and
_measure_.  In the OODB work I did we called this "impedance mismatch,"
and it is likely some boundary transitions are _much_ faster than
others.  The sum case is one of them; I am getting numpy booleans
back, rather than numpy booleans, so conversions aren't going fastpath.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Re: finding most common elements between thousands of multiple arrays.

Reply via email to