Re: Lucene to index OCR text

Daniel Noll Mon, 28 Jan 2008 18:34:41 -0800

On Friday 25 January 2008 19:26:44 Paul Elschot wrote:
> There is no way to do exact phrase matching on OCR data, because no
> correction of OCR data will be perfect. Otherwise the OCR would have made
> the correction...
> <snip suggestion to use fuzzy query>


The problem I see with a fuzzy query is that if you have the fuzziness set to 
1, then "fat" will match "mat".  But in reality, "f" and "m" don't get 
confused with OCR.

What you really want is for a given term to expand to a boolean query of all 
possible misidentified alternatives.  For that you would first need to figure 
out which characters are often misidentified as others, which can probably be 
achieved by going over a certain number of documents and manually checking 
which letters are wrong.

This should provide slightly more comprehensive matching without matching 
terms which are obviously different to the naked eye.

What would be ideal is if an analyser could do this job (a "looks like" 
analyser, like how SoundEx is a "sounds like" analyser.)  But I get the 
feeling that this would be very difficult.  Shame the OCR software can't 
store this information, e.g. "80% odds that this character is a t but 20% 
odds that it's an f."  If you had that for every character it would be very 
useful...

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene to index OCR text

Reply via email to