Lots of luck to you, because I haven't a clue. My company deals with
OCR data and we haven't had a single workable idea. Of course, our
data sets are minuscule compared to what you're talking about, so we
haven't tried to heuristically clean up the data.

But given that Google is scanning the entire U of Mich library, there has
to be an answer out there, but I wonder if it's applicable to already OCRd
data or whether it's the scanning itself.

There are, as you well know, two issues. First, are the words
recognizable. As in actual English words. Which is easily checkable via
a dictionary. Which doesn't help much since I've seen OCR that consists
of English words that are total nonsense. Assuming you're scanning
English texts. Assuming it's modern English......

Second, particularly in our case, we have a very significant number of names
to deal with. So a dictionary check is pretty useless.

We've squirmed out of the problem by having the tables of contents keyed
in by hand and then providing our users with links to the OCR image of the
scanned data. Since this is genealogy research, it at least gives them a way
to verify what our searches return. But inevitably there are false hits as
well
as false misses.

I've considered creating a dictionary of non-English words on the assumption
that
there will be a finite number of mis-spellings. But this is OCR data, so the
set of
misspelled words could very well be bigger than the total number of words in
the
English language, depending on the condition of your source and how well the
OCR
data is done. But, again, our situation is that the projects aren't large
enough
to make significant investments in even exploring this.

I suppose that one could think about asking a Dictionary program for
suggestions,
but I haven't a clue how useful that would be. Especially for names or
technical
data.

The LDS church (The Church of Jesus Christ of Latter-day Saints) is doing
something interesting that has the flavor of [EMAIL PROTECTED] They're getting
volunteers to
key in pages. Two different volunteers key in each page. Then a comparison
is
done and the differences are arbitrated.

As you can tell, I have nothing really useful to suggest on the scale you're
talking
about. 10^7 is a LOT of documents.

But I'd also be very interested in anything you come across. Especially in
the way
of cleaning existing OCRd data. Mostly, I'm expressing sympathy for the size
and complexity of the task you're undertaking <G>..

Best
Erick


On Jan 24, 2008 8:43 PM, Renaud Waldura <[EMAIL PROTECTED]>
wrote:

> I've been poking around the list archives and didn't really come up
> against
> anything interesting. Anyone using Lucene to index OCR text? Any
> strategies/algorithms/packages you recommend?
>
> I have a large collection (10^7 docs) that's mostly the result of OCR. We
> index/search/etc. with Lucene without any trouble, but OCR errors are a
> problem, when doing exact phrase matches in particular. I'm looking for
> ideas on how to deal with this thorny problem.
>
> --
> Renaud Waldura
> Applications Group Manager
> Library and Center for Knowledge Management
> University of California, San Francisco
> (415) 502-6660
>
>
>

Reply via email to