I'm using Lucene to spell check street names. Right now, I'm using Double Metaphone on the street name (we have a sophisticated regex to parse out the NAME as opposed to the unit, number, street type, or suffix). I think that Double Metaphone is probably overkill/wrong, and a spell checking approach (n-gram based) would be better. Part of the reason is if we look at some common mistakes:
For Commonwealth: Communwealth Comonwealth Common wealth Double metaphone will get the first two, but not the last. Spell check (I think) would get all 3. The last is much more common than in typical generic text search (Fairoaks vs. Fair Oaks, New Market vs. Newmarket, etc). However, spell check will only get the third if the n-gram input is untokenized (right?). Conceptually, I feel like people will most often misspell or mistype rather than completely omitting words from the street name. So running the n-gram on the untokenized street name seems like a good thing. Problem is I can't see how I do this, SpellChecker seems to always want to tokenize things, and I'm a bit confused on how to give it an analyzer that doesn't tokenize. I feel like this might be a newbie question, so apologies if so. But, 1) does an untokenized n-gram spell checker seem like a good thing for this app? 2) Which analyzer can I use for no tokenization at all? --Max