RE: Similar Document Search

Gregor Heinrich Tue, 26 Aug 2003 15:39:16 +0000

Hi Terry,

the suggestion of Haystack's Lucene was a hint to give you an additional
alternative to reach your goal.

Depending on the definition of your notion "similar document", this solution
does or does not make sense. My definition of similar document (and term) is
maybe more general than yours: It supports rather generic similarity metrics
and needs to cover cosine similarity according to vector-space model (VSM;
can be achieved using unmodified Lucene code), semantic similarity according
to a generative model like latent semantic indexing or Bayesian approaches
etc. and even semantic similarity according to a taxonomy. If you want such
a flexibility (like I do for my research), you should consider this approach
because you can relatively easily work on the forward document vectors.

If all you need is vanilla VSM cosine similarity, you are probably best off
with the suggestion that was sent in this list, to submit the document
content in the query and throw it through the same Analyzer that was used to
create the index, thus finding best matches using Lucene's standard matching
scheme.

Good luck,

Gregor

-----Original Message-----
From: Terry Steichen [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 21, 2003 2:54 PM
To: Lucene Users List
Subject: Re: Similar Document Search

Hi Peter,

I took a look at Mark's thesis and briefly at some of his code.  It appears
to me that what he's done with the so-called forward indexing is to (a)
include a unique id with each document (allowing retrieval by id rather than
by a standard query), and to (b) include a frequency map class with each
document (allowing easier retrieval of term frequency information).

Now I may be missing something very obvious, but it seems to me that both of
these functions can be done rather easily with the standard (unmodified)
version of Lucene.  Moreover, I don't understand how use of these functions
will facilitate retrieval of documents that are "similar" to a selected
document, as outlined in my original question on this topic.

Could you (or anyone else, of course) perhaps elaborate just a bit on how
using this approach will help achieve that end?

Regards,

Terry

----- Original Message -----
From: "Peter Becker" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Thursday, August 21, 2003 1:37 AM
Subject: Re: Similar Document Search

> Hi all,
>
> it seems there are quite a few people looking for similar features, i.e.
> (a) document identity and (b) forward indexing. So far we hacked (a) by
> using a wrapper implementing equals/hashcode based on a unique field,
> but of course that assumes maintaining a unique field in the index. (b)
> is something we haven't tackled yet, but plan to.
>
> The source code for Mark's thesis seems to be part of the Haystack
> distribution. The comments in the files put it under Apche-license. This
> seems to make it a good candidate to be included at least in the Lucene
> sandbox -- although I haven't tried it myself yet. But it sounds like a
> good candidate for us to use.
>
> Since the haystack source is a bit larger and I actually couldn't get
> the download at the moment, here is a copy of the relevant bit grabbed
> from one of my colleague's machines:
>
>   http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb)
>
> Note that this is just a tarball of src/org/apache/lucene out of some
> Haystack source. Untested, unmodified.
>
> I'd love to see something like this supported in the Lucene context were
> people might actually find it :-)
>
>   Peter
>
>
> Gregor Heinrich wrote:
>
> >Hello Terry,
> >
> >Lucene can do forward indexing, as Mark Rosen outlines in his Master's
> >thesis: http://citeseer.nj.nec.com/rosen03email.html.
> >
> >We use a similar approach for (probabilistic) latent semantic analysis
and
> >vector space searches. However, the solution is not really completely
fixed
> >yet, therefore no code at this time...
> >
> >Best regards,
> >
> >Gregor
> >
> >
> >
> >
> >-----Original Message-----
> >From: Peter Becker [mailto:[EMAIL PROTECTED]
> >Sent: Tuesday, August 19, 2003 3:06 AM
> >To: Lucene Users List
> >Subject: Re: Similar Document Search
> >
> >
> >Hi Terry,
> >
> >we have been thinking about the same problem and in the end we decided
> >that most likely the only good solution to this is to keep a
> >non-inverted index, i.e. a map from the documents to the terms. Then you
> >can query the most terms for the documents and query other documents
> >matching parts of this (where you get the usual question of what is
> >actually interesting: high frequency, low frequency or the mid range).
> >
> >Indexing would probably be quite expensive since Lucene doesn't seem to
> >support changes in the index, and the index for the terms would change
> >all the time. We haven't implemented it yet, but it shouldn't be hard to
> >code. I just wouldn't expect good performance when indexing large
> >collections.
> >
> >  Peter
> >
> >
> >Terry Steichen wrote:
> >
> >
> >
> >>Is it possible without extensive additional coding to use Lucene to
conduct
> >>
> >>
> >a search based on a document rather than a query?  (One use of this would
be
> >to refine a search by selecting one of the hits returned from the initial

> >query and subsequently retrieving other documents "like" the selected
one.)
> >
> >
> >>Regards,
> >>
> >>Terry
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: [EMAIL PROTECTED]
> >For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Similar Document Search

Reply via email to