You could take try take a large corpus of the text (say Wikipedia) and use it to inform the likelihood of word sequences. Take the OCR output and produce fuzzy spelling variations for each word in a window of text (say 5 or 6 words) and then examine the likelihood of the different permutations using the corpus. That's a lot of combinations, edit distance calculations and a lot of SpanQueries so performance will suffer but accuracy is likely to be better than anything based on single word analysis. As mentioned before, if a "confidence level" was available from the OCR software then that would avoid a lot of unnecessary lookups or potential replacement of correctly OCRed words with alternative words deemed to be statistically more likely.
Cheers Mark ----- Original Message ---- From: Paul Elschot <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 29 January, 2008 8:00:56 AM Subject: Re: Lucene to index OCR text Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll: > On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > > There is no way to do exact phrase matching on OCR data, because no > > correction of OCR data will be perfect. Otherwise the OCR would have made > > the correction... > > <snip suggestion to use fuzzy query> > > The problem I see with a fuzzy query is that if you have the fuzziness set to > 1, then "fat" will match "mat". But in reality, "f" and "m" don't get > confused with OCR. > > What you really want is for a given term to expand to a boolean query of all > possible misidentified alternatives. For that you would first need to figure > out which characters are often misidentified as others, which can probably be > achieved by going over a certain number of documents and manually checking > which letters are wrong. > > This should provide slightly more comprehensive matching without matching > terms which are obviously different to the naked eye. It's also possible to select the fuzzy terms by their document frequency, and reject all that have a ((quite) a bit) higher doc frequency than the given term. Combined with a query proximity to another similarly queried term this can work reasonably well. For query search performance selecting only low frequency terms is nice, as it avoids searching for high frequency terms. Btw, this use of a worse spelling is more or less the opposite of suggesting a better spelling from terms with a higher doc frequency. > > What would be ideal is if an analyser could do this job (a "looks like" > analyser, like how SoundEx is a "sounds like" analyser.) But I get the > feeling that this would be very difficult. Shame the OCR software can't > store this information, e.g. "80% odds that this character is a t but 20% > odds that it's an f." If you had that for every character it would be very > useful... Ah yes, the ideal world. Is there OCR software that provides such details? Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __________________________________________________________ Sent from Yahoo! Mail - a smarter inbox http://uk.mail.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]