[
https://issues.apache.org/jira/browse/LUCENE-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916779#action_12916779
]
Robert Muir commented on LUCENE-2507:
-------------------------------------
bq. That is a very good idea yes, but I don't think its necessary to do that
before this is committed.
Here's some *very rough* numbers from that batch0.tab, against the FIRE english
corpus (sorry i'm still downloading wikipedia, its quite large!)
Note, this is only relative, e.g. i dont even know if these terms all exist in
that corpus.
additionally, some contain punctuation etc, i only lowercased them for
consistency.
for reference, there are 547 incorrect/correct term pairs in this aspell
spelling correction test.
My corpus has ~150,000 docs, with 304,000 unique terms in the body field.
for both spellcheckers I used all defaults, e.g.
spellchecker.suggestSimilar(words[1].toLowerCase(), 1, reader, "body", true);
||impl||Number correct[1] (out of 547)||Number correct, inverted[2] (out of
547)||Avg time in ms[3]||
|SpellChecker|214|218|1.47ms
|DirectSpellChecker|242|303|4.53ms
1. using the misspelling as a query term, does the spellchecker return the
correct spelling?
2. using the correct spelling as a query term, does the spellchecker return
nothing at all?
3. this is the average time to perform an actual correction, both spellcheckers
have some way to do no work at all for the common (correctly spelled) case.
So although the benchmark itself isnt for search engine benchmarking (e.g.
contains stopwords/punctuation), this basically shows what I've been seeing,
that I think this spellchecker outperforms the existing one, and the perf cost
is reasonable.
> automaton spellchecker
> ----------------------
>
> Key: LUCENE-2507
> URL: https://issues.apache.org/jira/browse/LUCENE-2507
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Reporter: Robert Muir
> Assignee: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch,
> LUCENE-2507.patch
>
>
> The current spellchecker makes an n-gram index of your terms, and queries
> this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an
> algorithm such as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the
> index, then we wouldn't need
> a separate index to rebuild.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]