So the idea here is that if you do a nearest neighbor search for k items,
then you want a few things to be true:

1) you want sufficient overlap between the k approximately nearest
neighbors and the k truly nearest neighbors.  Sufficient overlap in
different applications differs, but 50% or more seems like a good idea.  A
lower overlap might be OK if the second property holds with a tight bound

2) you want the average, median and maximum distances for the approximate
nearest set to be not vastly larger than the corresponding values for the
exact nearest set.  If the average and maximum distances are about the
same, then you probably don't care much about property (1) since you have a
set that is about as good.  You might even only care about maximum
distances.

So the question is how to test these properties given that both are only
going to hold in probability rather than absolutely.  We want to pick
thresholds that are tight so that misbehaving searchers will fail, but
loose so that good searchers won't fail too often.  Probably we should run
several tests and assert something about the distribution of results to
sharpen this test.  In particular, if the measurement in question has a
long tail, then averages might give us much sharper test criteria.

On Thu, Nov 15, 2012 at 11:22 AM, Dan Filimon
<[email protected]>wrote:

> Hi!
>
> Ted, the old fast projection search was failing in the StreamingKMeans
> test so I rewrote and simplified it.
> It now works with StreamingKMeans and passes its own
> FastProjectionSearch tests - most of the time.
>
> The problem is testEpsilon which compares the distances obtained by
> doing a BruteSearch and a FastProjection search on a set of random
> vectors (sampled from LumpyDaya).
>
> This test sometimes fails and sometimes succeeds. The relevant
> assertion is the one checking whether bigRatio < 2 [1] (line 115).
> This represents the number of vectors for which the difference between
> FPS and BS are over 1.4, so the test will pass as long as there is at
> most 1.
>
> However there are somtimes 2 or 3 of these and it fails (sometimes
> there are 0 or 1 too).
> Also, if I comment this assert out, the averageOverlap assertion passes.
>
> This entire test looks fishy to me. Where are all the numbers coming from?
> :)
>
> [1]
> https://github.com/dfilimon/knn/blob/master/src/test/java/org/apache/mahout/knn/search/FastProjectionSearchTest.java
>

Reply via email to