RE: Lucene to index OCR text

Itamar Syn-Hershko Fri, 25 Jan 2008 01:05:47 -0800

In our (very) small project (several thousands of pages), we scan what we
can scan (and type what is not scannable), and then take someone to
read-proof the OCRd material. Precision matters in our case, and this seemed
to be the only way. One thought I had on your case - maybe there's an OCR
library better than what you're using, which could yield better results?
Better yet, can you execute several libraries on your scanned pages, and
then compare results, using a dictionary to decide if each returned
different possibility, and mark the word for human review if differences are
too big?

Itamar.

-----Original Message-----
From: Paul Elschot [mailto:[EMAIL PROTECTED] 
Sent: Friday, January 25, 2008 10:27 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene to index OCR text

Op Friday 25 January 2008 03:46:23 schreef Kyle Maxwell:
> > I've been poking around the list archives and didn't really come up 
> > against anything interesting. Anyone using Lucene to index OCR text? 
> > Any strategies/algorithms/packages you recommend?
> >
> > I have a large collection (10^7 docs) that's mostly the result of 
> > OCR. We index/search/etc. with Lucene without any trouble, but OCR 
> > errors are a problem, when doing exact phrase matches in particular. 
> > I'm looking for ideas on how to deal with this thorny problem.
> 
> How about Letter-by-letter ngrams coupled with SpanQueries (or more 
> likely, a custom query utilizing the TermPositions iterator)?
> 

There is no way to do exact phrase matching on OCR data, because no
correction of OCR data will be perfect. Otherwise the OCR would have made
the correction...

What you'll need is something like a fuzzy query as the leafs of a phrase
query.
Also, there may be missing word boundaries, and in that case you'll have to
use a truncation query instead of a phrase query.

The more fuzzyness introduced in the query, the higher the chance of false
matches, so there really is no single answer to this. It depends on how many
false matches the users will accept and on how many OCR errors there are.

One could start by adding some fuzzy term matching to phrase query, and see
what the users think of that. They will lose some performance, and that is
another factor in the fuzzyness tradeoff.

SpanQueries could be used too, for these a fuzzy term match would need to be
added, as well as a query parser. When adding fuzzy term matching to a
phrase query looks to be a bit daunting, have a look at the surround query
parser in the contrib area. It has truncation and proximity based on span
queries, but no fuzzy term matching, so it could also be a start for
investigating.

It all depends on how good the OCR was, but in some cases (think old paper)
it's just not possible to do good OCR. 

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene to index OCR text

Reply via email to