[jira] Commented: (LUCENE-2507) automaton spellchecker

Robert Muir (JIRA) Thu, 30 Sep 2010 21:49:01 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916779#action_12916779
 ]


Robert Muir commented on LUCENE-2507:
-------------------------------------

bq. That is a very good idea yes, but I don't think its necessary to do that 
before this is committed.

Here's some *very rough* numbers from that batch0.tab, against the FIRE english 
corpus (sorry i'm still downloading wikipedia, its quite large!)
Note, this is only relative, e.g. i dont even know if these terms all exist in 
that corpus.
additionally, some contain punctuation etc, i only lowercased them for 
consistency.

for reference, there are 547 incorrect/correct term pairs in this aspell 
spelling correction test.
My corpus has ~150,000 docs, with 304,000 unique terms in the body field.
for both spellcheckers I used all defaults, e.g. 
spellchecker.suggestSimilar(words[1].toLowerCase(), 1, reader, "body", true);

||impl||Number correct[1] (out of 547)||Number correct, inverted[2] (out of 
547)||Avg time in ms[3]||
|SpellChecker|214|218|1.47ms
|DirectSpellChecker|242|303|4.53ms

1. using the misspelling as a query term, does the spellchecker return the 
correct spelling?
2. using the correct spelling as a query term, does the spellchecker return 
nothing at all?
3. this is the average time to perform an actual correction, both spellcheckers 
have some way to do no work at all for the common (correctly spelled) case.

So although the benchmark itself isnt for search engine benchmarking (e.g. 
contains stopwords/punctuation), this basically shows what I've been seeing, 
that I think this spellchecker outperforms the existing one, and the perf cost 
is reasonable.


> automaton spellchecker
> ----------------------
>
>                 Key: LUCENE-2507
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2507
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch, 
> LUCENE-2507.patch
>
>
> The current spellchecker makes an n-gram index of your terms, and queries 
> this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an 
> algorithm such as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the 
> index, then we wouldn't need
> a separate index to rebuild.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2507) automaton spellchecker

Reply via email to