Re: Document similarity

Aleksey Serba Fri, 20 Jan 2006 10:32:31 -0800

Yonik, Klaus, thanks for your quick response.

Let me rephrase, i can't compare currently processed document with all
documents in my collection using angle between documents in
terms-vector space because of performance issues. As far as i can see,
i can avoid unnecessary operations. At first, i can build query from
document terms, fetch top N results and compute angle only for them.
Is it ok?


The second question is
How to generate some information about documents similarity to store
in lucene index?
For example, hash with the same values for similar documents or
something like that.
Thus it would be easy to filter "supplemental" results.


On 1/20/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> If you didn't want to store term vectors you could also run the
> document fields through the analyzer yourself and collect the Tokens
> (you should still have the fields you just indexed... no need to
> retrieve it again).
>
> -Yonik
>
> On 1/20/06, Klaus <[EMAIL PROTECTED]> wrote:
> >
> > >In my case, i need to filter similar documents in search results and
> > >therefore determine document similarity during indexing process using
> > >term vectors. Obviously, i can't compare currently indexing document
> > >with all documents in my collection.
> >
> > Yes you can. Right after indexing the new documents fetch the termvector for
> > this document from the index. Computer some kind of weight for each term,
> > und construct a Boolean query from all terms. You can use the termweights to
> > boost the termqueries. The hits will be scored, this score is a measure for
> > the similarity between the documents.
> >
> > peace
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document similarity

Reply via email to