Hi Peter, I took a look at Mark's thesis and briefly at some of his code. It appears to me that what he's done with the so-called forward indexing is to (a) include a unique id with each document (allowing retrieval by id rather than by a standard query), and to (b) include a frequency map class with each document (allowing easier retrieval of term frequency information).
Now I may be missing something very obvious, but it seems to me that both of these functions can be done rather easily with the standard (unmodified) version of Lucene. Moreover, I don't understand how use of these functions will facilitate retrieval of documents that are "similar" to a selected document, as outlined in my original question on this topic. Could you (or anyone else, of course) perhaps elaborate just a bit on how using this approach will help achieve that end? Regards, Terry ----- Original Message ----- From: "Peter Becker" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Thursday, August 21, 2003 1:37 AM Subject: Re: Similar Document Search > Hi all, > > it seems there are quite a few people looking for similar features, i.e. > (a) document identity and (b) forward indexing. So far we hacked (a) by > using a wrapper implementing equals/hashcode based on a unique field, > but of course that assumes maintaining a unique field in the index. (b) > is something we haven't tackled yet, but plan to. > > The source code for Mark's thesis seems to be part of the Haystack > distribution. The comments in the files put it under Apche-license. This > seems to make it a good candidate to be included at least in the Lucene > sandbox -- although I haven't tried it myself yet. But it sounds like a > good candidate for us to use. > > Since the haystack source is a bit larger and I actually couldn't get > the download at the moment, here is a copy of the relevant bit grabbed > from one of my colleague's machines: > > http://www.itee.uq.edu.au/~pbecker/luceneHaystack.tar.gz (22kb) > > Note that this is just a tarball of src/org/apache/lucene out of some > Haystack source. Untested, unmodified. > > I'd love to see something like this supported in the Lucene context were > people might actually find it :-) > > Peter > > > Gregor Heinrich wrote: > > >Hello Terry, > > > >Lucene can do forward indexing, as Mark Rosen outlines in his Master's > >thesis: http://citeseer.nj.nec.com/rosen03email.html. > > > >We use a similar approach for (probabilistic) latent semantic analysis and > >vector space searches. However, the solution is not really completely fixed > >yet, therefore no code at this time... > > > >Best regards, > > > >Gregor > > > > > > > > > >-----Original Message----- > >From: Peter Becker [mailto:[EMAIL PROTECTED] > >Sent: Tuesday, August 19, 2003 3:06 AM > >To: Lucene Users List > >Subject: Re: Similar Document Search > > > > > >Hi Terry, > > > >we have been thinking about the same problem and in the end we decided > >that most likely the only good solution to this is to keep a > >non-inverted index, i.e. a map from the documents to the terms. Then you > >can query the most terms for the documents and query other documents > >matching parts of this (where you get the usual question of what is > >actually interesting: high frequency, low frequency or the mid range). > > > >Indexing would probably be quite expensive since Lucene doesn't seem to > >support changes in the index, and the index for the terms would change > >all the time. We haven't implemented it yet, but it shouldn't be hard to > >code. I just wouldn't expect good performance when indexing large > >collections. > > > > Peter > > > > > >Terry Steichen wrote: > > > > > > > >>Is it possible without extensive additional coding to use Lucene to conduct > >> > >> > >a search based on a document rather than a query? (One use of this would be > >to refine a search by selecting one of the hits returned from the initial > >query and subsequently retrieving other documents "like" the selected one.) > > > > > >>Regards, > >> > >>Terry > >> > >> > >> > >> > >> > > > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > >--------------------------------------------------------------------- > >To unsubscribe, e-mail: [EMAIL PROTECTED] > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
