: That being said, I could see maybe determining a delta value such that if the : distance between any two scores is more than the delta, you cut off the rest : of the docs. This takes into account the relative state of scores and is not : some arbitrary value (although, the delta is, of course)
I read an interesting paper a while back that suggested a similar strategy for a related problem... http://www.isi.edu/integration/people/michelso/paps/ijdar2007.pdf ...while the whole paper might be interesting to some, the relevant parts to this discussion are Section!2.1 and Table#1 . the goal there is to identify which refrence set(s) are relevant to an input set -- they compute a similarty score for each set, sort them, and then compute the percentage difference for each successive pair. they consider any set with a score above the average score for all sets *and* with a score percentage diff (relative the next highest scoring set) greater then some arbitrary delta to be a match. (the theory being that an arbitrary percentage delta is better then an arbitrary score cutoff, and that you only want things scoring better then average, because as scores taper off on the lower end, they can taper off quickly and show very high percentage differneces. I have no idea how well this approach would work for general search (with a large set of documents and a large number of matches) To keep in mind just how diverse the appraoches to this type of problem can be depending on the nitty gritty specifics of your use case, consider the "GuardianComponent" example from my BTB talk at apachecon last year (slides 32-25)... http://people.apache.org/~hossman/apachecon2008us/btb/apache-solr-beyond-the-box.pdf ...either of the approaches mention there tackle the "sacrifice recall to achieve greater precision" aspect of your problem in the specific domain of short documents where you want to eliminate matches that are significantly longer then the input (even if they score well using traditional tf/idf metrics) -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
