Mathieu,

Have you thought about incorporating a standard format for thesaurus and
thus for query/index expansion. Here is the recommendation from NISO:
http://www.niso.org/committees/MT-info.html

Beyond synonyms, having the capabilities to specify the use of BT (broader
terms or Hypernyms) or NT (narrower terms or Hyponyms) is very useful to
provide more general or specific context to the query.

There are other tricks such as weighing terms from a thesaurus based on the
number of occurences in the index, as well as extracting potencial
"used-as-for" terms by looking at patters such as  a word followed by a
parethesis with small number of tokens (i.e.  "term (<alternate term>)").

J.D.


On Thu, Mar 13, 2008 at 2:52 AM, Mathieu Lecarme <[EMAIL PROTECTED]>
wrote:

> I'll slice my contrib in small parts
>
> Synonyms
> 1) Synonym (Token + a weight)
> 2) Synonym provider from OO.o thesaurus
> 3) SynonymTokenFilter
> 4) Query expander wich apply a filter (and a boost) on each of its
> TermQuery
> 5) a Synonym filter for the query expander
> 6) to be efficient, Synonym can be exclude if doesn't exist in the Index.
> 7) Stemming can be used as a dynamic Synonym
>
> Spell checking or the "do you mean?" pattern
> 1) The main concept is in the SpellCheck contrib, but in a not
> expandable implementation
> 2) In some language, like French, homophony is very important in
> mispelling, "there is more than one way to write it"
> 3) Homophony rules is provided by Aspell in a neutral language (just
> like SnowBall for stemming), I implemented a translator to build Java
> class from aspell file (it's the same format in aspell evolution :
> myspell and hunspell, wich are used in OO.o and firefox family)
> https://issues.apache.org/jira/browse/LUCENE-956
>
> Storing information about word found in an index
> 1) It's the Dictionary used in SpellCheck contrib, in a more open way :
> a lexicon. It's a plain old lucene index, word become a Document, and
> Field store computed informations like size, Ngram token and homophony.
> All use filter took from TokenFilter, code duplication is avoided.
> 2) this information can be not synchronized with the index, in order to
> not slow indexation process, so some informations need to be lately
> check (is this synonym already exist in the index?), and lexicon
> correction can be done on the fly (if the synonym doesn't exist, write
> it in the lexicon for the next time). There is some work here to find
> the best and fastest way to keep information synchronized between index
> and lexicon (hard link, log for nightly replay, complete iteration over
> the index to find deleted and new stuff ...)
> 3) Similar (more than only Synonym) and Near (mispelled) words use
> Lexicon.
> https://issues.apache.org/jira/browse/LUCENE-1190
>
> Extending it
> 1) Lexicon can be used to store Noun, ie words that better work
> together, like "New York", "Apple II" or "Alexander the great".
> Extracting nouns from a thesaurus is very hard, but Wikipedia peoples
> done a part of the work, article titles can be a good start to build a
> noun list. And it works in many languages.
> Noun can be used as an intuitive PhraseQuery, or as a suggestion for
> refining a results.
>
> Implementig it well in Lucene
> SpellCheck and WordNet contrib do a part of it, but in a specific and
> not extensible way, I think it's better when fundation is checked by
> Lucene maintener, and after, contrib is built on top of this fundation.
>
> M.
>
>
> Otis Gospodnetic a écrit :
> > Grant, I think Mathieu is hinting at his JIRA contribution (I looked at
> it briefly the other day, but haven't had the chance to really understand
> it).
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> > ----- Original Message ----
> > From: Mathieu Lecarme <[EMAIL PROTECTED]>
> > To: java-dev@lucene.apache.org
> > Sent: Wednesday, March 12, 2008 5:47:40 AM
> > Subject: an API for synonym in Lucene-core
> >
> > Why Lucen doesn't have a clean synonym API?
> > WordNet contrib is not an answer, it provides an Interface for its own
> > needs, and most of the world don't speak english.
> > Compass provides a tool, just like Solr. Lucene is the framework for
> > applications like Solr, Nutch or Compass, why don't backport low level
> > features of this project?
> > A synonym API should provide a TokenFilter, an abstract storage should
> > map token -> similar tokens with weight, and a tools for expanding
> query.
> > Openoffice dictionnary project can provides data in differents
> > languages, with compatible licences, I  presume.
> >
> > M.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Reply via email to