On Dec 12, 2008, at 5:36 AM, Lucene User no 1981 wrote:


Grant,

It's definitely dictionary based spell checker. A bit fleshing out,
currently the document gets indexed and then it's analysed (bad words,
repetitions etc), spell check - no corrections - would be yet another
step in the process. It's all read-only stuff, the document content is
not modified, it's just tagged accordingly.
That said, I kind of like your idea, I mean token filter looks like the
good candidate. As of Lazzy, is it any different than Lucene
SpellChecker (ngram based)?

Yes, Jazzy is actually a dictionary of correctly spelled words. Lucene's approach (at least the index based one) is merely a dictionary of words that occur in your corpus, misspellings and all. So, if your goal is to tag words that are really, truly spelled incorrectly, than I'd say Jazzy or some other dictionary tool is the way to go.


what really matters here is not the
accuracy (decent but not exceptional - there is a manual double- check
of tagged docs anyway), what matters most is performance and ease of
integration. Any grammar check is absolutely immaterial.
About that payload idea, I can only work with a token in a filter. I
could attach something and spit it out, but what would be that
something? It would have to be searchable I assume, otherwise I could
perform the check without filter, out of index. If it's searchable
then, apart from querying, I could perhaps make highlighter work with
it nicely.

Payloads live on Tokens. See the Token.setPayload() method. It would then be searchable by using the BoostingTermQuery (BTQ) but you may need to write some other type of query. For instance, the BTQ will allow you to say, I believe, give me all documents where a particular terms is misspelled and you can specify that term. However, you may also want "give me all documents that have misspellings" and that is not something the BTQ can do. You probably could hack up the MatchAllDocsQuery to do it though. Or you could maybe write a QueryFilter that turns on all docs that have a payload present. This is totally out there at this point, so take it with a grain of salt. I think you can achieve what you want, but it will take some lifting.

I have no clue on the performance, but I think the indexing approach could be pretty fast, especially if you can perhaps test a cache of commonly misspelled terms, but I would test that first.

Cheers,
Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to