Re: Spell check of a large text

Grant Ingersoll Fri, 12 Dec 2008 09:58:03 -0800


On Dec 12, 2008, at 5:36 AM, Lucene User no 1981 wrote:


Grant,

It's definitely dictionary based spell checker. A bit fleshing out,
currently the document gets indexed and then it's analysed (bad words,
repetitions etc), spell check - no corrections - would be yet another
step in the process. It's all read-only stuff, the document content is
not modified, it's just tagged accordingly.

That said, I kind of like your idea, I mean token filter looks likethe

good candidate. As of Lazzy, is it any different than Lucene
SpellChecker (ngram based)?

Yes, Jazzy is actually a dictionary of correctly spelled words.Lucene's approach (at least the index based one) is merely adictionary of words that occur in your corpus, misspellings and all.So, if your goal is to tag words that are really, truly spelledincorrectly, than I'd say Jazzy or some other dictionary tool is theway to go.

what really matters here is not the
accuracy (decent but not exceptional - there is a manual double- check
of tagged docs anyway), what matters most is performance and ease of
integration. Any grammar check is absolutely immaterial.
About that payload idea, I can only work with a token in a filter. I
could attach something and spit it out, but what would be that
something? It would have to be searchable I assume, otherwise I could
perform the check without filter, out of index. If it's searchable
then, apart from querying, I could perhaps make highlighter work with
it nicely.

Payloads live on Tokens. See the Token.setPayload() method. It wouldthen be searchable by using the BoostingTermQuery (BTQ) but you mayneed to write some other type of query.For instance, the BTQ will allow you to say, I believe, give me alldocuments where a particular terms is misspelled and you can specifythat term. However, you may also want "give me all documents thathave misspellings" and that is not something the BTQ can do. Youprobably could hack up the MatchAllDocsQuery to do it though. Or youcould maybe write a QueryFilter that turns on all docs that have apayload present. This is totally out there at this point, so take itwith a grain of salt. I think you can achieve what you want, but itwill take some lifting.

I have no clue on the performance, but I think the indexing approachcould be pretty fast, especially if you can perhaps test a cache ofcommonly misspelled terms, but I would test that first.


Cheers,

Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Spell check of a large text

Reply via email to