> From: Brian Goetz [mailto:[EMAIL PROTECTED]]
> 
> > Another case, which does not seem to be supported
> > is when a token is replaced with a sequence of tokens, each
> > representing an *alternative* meaning. Here is an example:
> > 
> >  'dog' -> 'dog',  'pet'
> >  'cat' -> 'cat',  'pet'
> >  'pet' -> 'pet'
> > 
> > When you search for 'pet' you want to match also documents 
> with 'dog' and
> > 'cat' but when you search for 'dog' you don't want to match 
> 'cat' or 'pet'.
> 
> This strategy can be accomodated in the current design as follows: 
> have two analyzers, one which expands tokens to include their 
> synonyms,
> and one that doesn't.  Use the first for document tokenization and the
> second for query tokenization.  Voila, everyone's happy.  

Except phrase searches would then be broken.  So unless you're not using
phrase search, thesaurus expansion is probably better done at query time.

(In theory, it is possible to modify the indexing code so that thesaurus
expansion while indexing would work.  At root, you'd need to alter
DocumentWriter line 126 so that multiple terms could be added at the same
position.  Perhaps one could add a subclass of Token that had multiple
values, or something.  However, in practice, I don't think this is worth it,
since thesauri aren't really very useful in search.  There are too few true
synonyms.)

It would be fairly easy to extend the query parser to better support
thesaurus expansion.  Just add a method:
  protected Query termQuery(Term term) {
    return new TermQuery(term);
  }
Subclasses could override this to lookup the term in a thesaurus, and build
a BooleanQuery for the synonyms.  Thesaurus expansion in phrases would be a
little hairier, but possible, and of even more questionable utility.

Doug

_______________________________________________
Lucene-dev mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/lucene-dev

Reply via email to