On Friday 25 January 2008 19:26:44 Paul Elschot wrote: > There is no way to do exact phrase matching on OCR data, because no > correction of OCR data will be perfect. Otherwise the OCR would have made > the correction... > <snip suggestion to use fuzzy query>
The problem I see with a fuzzy query is that if you have the fuzziness set to 1, then "fat" will match "mat". But in reality, "f" and "m" don't get confused with OCR. What you really want is for a given term to expand to a boolean query of all possible misidentified alternatives. For that you would first need to figure out which characters are often misidentified as others, which can probably be achieved by going over a certain number of documents and manually checking which letters are wrong. This should provide slightly more comprehensive matching without matching terms which are obviously different to the naked eye. What would be ideal is if an analyser could do this job (a "looks like" analyser, like how SoundEx is a "sounds like" analyser.) But I get the feeling that this would be very difficult. Shame the OCR software can't store this information, e.g. "80% odds that this character is a t but 20% odds that it's an f." If you had that for every character it would be very useful... Daniel --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]