[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Otis Gospodnetic (JIRA) Sat, 01 Mar 2008 20:02:10 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574182#action_12574182
 ]


Otis Gospodnetic commented on LUCENE-1190:
------------------------------------------

This sounds like something that might be interesting, but honestly I don't 
follow the initial description and the 300KB+ patch is a big one.

For example, I don't know what you mean by "Some Lucene features need a list of 
referring word".  Do you mean "a list of associated words"?

{quote}
Lexicon uses a Lucene Directory, each Word is a Document, each meta is a Field 
(word, ngram, phonetic, fields, anagram, size ...).
{quote}

Each meta is a Field.... what do you mean by that?  Could you please give an 
example?

{quote}
Above a minimum size, number of differents words used in an index can be 
considered as stable. So, a standard Lexicon (built from wikipedia by example) 
can be used.
{quote}

Hm, not sure I know what you mean.  Are you saying that once you create a 
sufficiently large lexicon/dictionary/index, the number of new terms starts 
decreasing? (Heap's Law? http://en.wikipedia.org/wiki/Heaps'_law )


> a lexicon object for merging spellchecker and synonyms from stemming
> --------------------------------------------------------------------
>
>                 Key: LUCENE-1190
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1190
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*, Search
>    Affects Versions: 2.3
>            Reporter: Mathieu Lecarme
>         Attachments: aphone+lexicon.patch, aphone+lexicon.patch
>
>
> Some Lucene features need a list of referring word. Spellchecking is the 
> basic example, but synonyms is an other use. Other tools can be used 
> smoothlier with a list of words, without disturbing the main index : stemming 
> and other simplification of word (anagram, phonetic ...).
> For that, I suggest a Lexicon object, wich contains words (Term + frequency), 
> wich can be built from Lucene Directory, or plain text files.
> Classical TokenFilter can be used with Lexicon (LowerCaseFilter and 
> ISOLatin1AccentFilter should be the most useful).
> Lexicon uses a Lucene Directory, each Word is a Document, each meta is a 
> Field (word, ngram, phonetic, fields, anagram, size ...).
> Above a minimum size, number of differents words used in an index can be 
> considered as stable. So, a standard Lexicon (built from wikipedia by 
> example) can be used.
> A similarTokenFilter is provided.
> A spellchecker will come soon.
> A fuzzySearch implementation, a neutral synonym TokenFilter can be done.
> Unused words can be remove on demand (lazy delete?)
> Any criticism or suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1190) a lexicon object for merging spellchecker and synonyms from stemming

Reply via email to