So the idea here is that if you do a nearest neighbor search for k items, then you want a few things to be true:
1) you want sufficient overlap between the k approximately nearest neighbors and the k truly nearest neighbors. Sufficient overlap in different applications differs, but 50% or more seems like a good idea. A lower overlap might be OK if the second property holds with a tight bound 2) you want the average, median and maximum distances for the approximate nearest set to be not vastly larger than the corresponding values for the exact nearest set. If the average and maximum distances are about the same, then you probably don't care much about property (1) since you have a set that is about as good. You might even only care about maximum distances. So the question is how to test these properties given that both are only going to hold in probability rather than absolutely. We want to pick thresholds that are tight so that misbehaving searchers will fail, but loose so that good searchers won't fail too often. Probably we should run several tests and assert something about the distribution of results to sharpen this test. In particular, if the measurement in question has a long tail, then averages might give us much sharper test criteria. On Thu, Nov 15, 2012 at 11:22 AM, Dan Filimon <[email protected]>wrote: > Hi! > > Ted, the old fast projection search was failing in the StreamingKMeans > test so I rewrote and simplified it. > It now works with StreamingKMeans and passes its own > FastProjectionSearch tests - most of the time. > > The problem is testEpsilon which compares the distances obtained by > doing a BruteSearch and a FastProjection search on a set of random > vectors (sampled from LumpyDaya). > > This test sometimes fails and sometimes succeeds. The relevant > assertion is the one checking whether bigRatio < 2 [1] (line 115). > This represents the number of vectors for which the difference between > FPS and BS are over 1.4, so the test will pass as long as there is at > most 1. > > However there are somtimes 2 or 3 of these and it fails (sometimes > there are 0 or 1 too). > Also, if I comment this assert out, the averageOverlap assertion passes. > > This entire test looks fishy to me. Where are all the numbers coming from? > :) > > [1] > https://github.com/dfilimon/knn/blob/master/src/test/java/org/apache/mahout/knn/search/FastProjectionSearchTest.java >
