In our (very) small project (several thousands of pages), we scan what we can scan (and type what is not scannable), and then take someone to read-proof the OCRd material. Precision matters in our case, and this seemed to be the only way. One thought I had on your case - maybe there's an OCR library better than what you're using, which could yield better results? Better yet, can you execute several libraries on your scanned pages, and then compare results, using a dictionary to decide if each returned different possibility, and mark the word for human review if differences are too big?
Itamar. -----Original Message----- From: Paul Elschot [mailto:[EMAIL PROTECTED] Sent: Friday, January 25, 2008 10:27 AM To: java-user@lucene.apache.org Subject: Re: Lucene to index OCR text Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell: > > I've been poking around the list archives and didn't really come up > > against anything interesting. Anyone using Lucene to index OCR text? > > Any strategies/algorithms/packages you recommend? > > > > I have a large collection (10^7 docs) that's mostly the result of > > OCR. We index/search/etc. with Lucene without any trouble, but OCR > > errors are a problem, when doing exact phrase matches in particular. > > I'm looking for ideas on how to deal with this thorny problem. > > How about Letter-by-letter ngrams coupled with SpanQueries (or more > likely, a custom query utilizing the TermPositions iterator)? > There is no way to do exact phrase matching on OCR data, because no correction of OCR data will be perfect. Otherwise the OCR would have made the correction... What you'll need is something like a fuzzy query as the leafs of a phrase query. Also, there may be missing word boundaries, and in that case you'll have to use a truncation query instead of a phrase query. The more fuzzyness introduced in the query, the higher the chance of false matches, so there really is no single answer to this. It depends on how many false matches the users will accept and on how many OCR errors there are. One could start by adding some fuzzy term matching to phrase query, and see what the users think of that. They will lose some performance, and that is another factor in the fuzzyness tradeoff. SpanQueries could be used too, for these a fuzzy term match would need to be added, as well as a query parser. When adding fuzzy term matching to a phrase query looks to be a bit daunting, have a look at the surround query parser in the contrib area. It has truncation and proximity based on span queries, but no fuzzy term matching, so it could also be a start for investigating. It all depends on how good the OCR was, but in some cases (think old paper) it's just not possible to do good OCR. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]