I'm using Lucene to spell check street names.  Right now, I'm using
Double Metaphone on the street name (we have a sophisticated regex to
parse out the NAME as opposed to the unit, number, street type, or
suffix).  I think that Double Metaphone is probably overkill/wrong, and
a spell checking approach (n-gram based) would be better.  Part of the
reason is if we look at some common mistakes:

 

For Commonwealth:

Communwealth

Comonwealth

Common wealth

 

Double metaphone will get the first two, but not the last.  Spell check
(I think) would get all 3.  The last is much more common than in typical
generic text search (Fairoaks vs. Fair Oaks, New Market vs. Newmarket,
etc).  However, spell check will only get the third if the n-gram input
is untokenized (right?).

 

 Conceptually, I feel like people will most often misspell or mistype
rather than completely omitting words from the street name.  So running
the n-gram on the untokenized street name seems like a good thing.
Problem is I can't see how I do this, SpellChecker seems to always want
to tokenize things, and I'm a bit confused on how to give it an analyzer
that doesn't tokenize.

 

I feel like this might be a newbie question, so apologies if so.  But,
1) does an untokenized n-gram spell checker seem like a good thing for
this app? 2) Which analyzer can I use for no tokenization at all?

 

--Max

Reply via email to