Op Tuesday 29 January 2008 03:32:08 schreef Daniel Noll: > On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > > There is no way to do exact phrase matching on OCR data, because no > > correction of OCR data will be perfect. Otherwise the OCR would have made > > the correction... > > <snip suggestion to use fuzzy query> > > The problem I see with a fuzzy query is that if you have the fuzziness set to > 1, then "fat" will match "mat". But in reality, "f" and "m" don't get > confused with OCR. > > What you really want is for a given term to expand to a boolean query of all > possible misidentified alternatives. For that you would first need to figure > out which characters are often misidentified as others, which can probably be > achieved by going over a certain number of documents and manually checking > which letters are wrong. > > This should provide slightly more comprehensive matching without matching > terms which are obviously different to the naked eye.
It's also possible to select the fuzzy terms by their document frequency, and reject all that have a ((quite) a bit) higher doc frequency than the given term. Combined with a query proximity to another similarly queried term this can work reasonably well. For query search performance selecting only low frequency terms is nice, as it avoids searching for high frequency terms. Btw, this use of a worse spelling is more or less the opposite of suggesting a better spelling from terms with a higher doc frequency. > > What would be ideal is if an analyser could do this job (a "looks like" > analyser, like how SoundEx is a "sounds like" analyser.) But I get the > feeling that this would be very difficult. Shame the OCR software can't > store this information, e.g. "80% odds that this character is a t but 20% > odds that it's an f." If you had that for every character it would be very > useful... Ah yes, the ideal world. Is there OCR software that provides such details? Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]