I am processing a bunch of text coming out of OCR, i.e. it's machine-generated
text that contains some errors like garbage characters attached to words,
letters replaced with similarly looking characters (e.g. "I" with "1") etc. The
text is whitespace-tokenized and I am trying to match each token against an
index using a fuzzy match, so that small amounts of occasional garbage in the
tokens do not prevent a match.
Right now I am preprocessing each query as follows:
//term = token
Query queryF = parser.Parse(term.Replace("~", "") + "~");
However, searcher.Search still throws "can't parse" exceptions for queries that
contain brackets, quotes and other garbage characters.
So how should I fully preprocess a query to avoid these exceptions?
Looks like I just need to remove a certain set of characters just like the
tilde is removed above. What is the complete set of such characters? Do I need
to do any other preprocess?
Thanks,
Ilya Zavorin