Re: Removing similar documents from search results

Miles Barr Tue, 15 Mar 2005 02:34:25 -0800

On Mon, 2005-03-14 at 10:24 -0800, David Spencer wrote:
> Yes, in theory the "similarity" package in the sandbox can help.
> The code generates a query for a source document to find documents that 
> are similar to it - the MoreLikeThis class uses the heuristic that 2 
> docs are similar if they share "interesting" words. "Interesting" words 
> are words that are common in a source doc but not too common in the 
> corpus. If you were do do this you'd do something like this:
> 
> [1] Do your normal query
> [2] As you loop thru the results, for every doc
> [2a]  generate a similarity query
> [2b]  requery the index for similar docs
> [2c]  then, maybe, for every doc from [2b] with a score above some 
> threshold, it it's also high up in the results from [2] then "hide" the 
> doc a la google et. al.
> 
> Could be tricky coding. Another way is to only show 1 doc from any given 
> domain. Note that instead of 1 query you'll have "1+n" queries for the 
> display of "n" search results.


That sounds like an interesting approach. But I'll probably wait until
Chuck's patch is included. I'm also a bit worried about the performance
of this approach. It might add too much time to each query.



-- 
Miles Barr <[EMAIL PROTECTED]>
Runtime Collective Ltd.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Removing similar documents from search results

Reply via email to