Lots of luck to you, because I haven't a clue. My company deals with OCR data and we haven't had a single workable idea. Of course, our data sets are minuscule compared to what you're talking about, so we haven't tried to heuristically clean up the data.
But given that Google is scanning the entire U of Mich library, there has to be an answer out there, but I wonder if it's applicable to already OCRd data or whether it's the scanning itself. There are, as you well know, two issues. First, are the words recognizable. As in actual English words. Which is easily checkable via a dictionary. Which doesn't help much since I've seen OCR that consists of English words that are total nonsense. Assuming you're scanning English texts. Assuming it's modern English...... Second, particularly in our case, we have a very significant number of names to deal with. So a dictionary check is pretty useless. We've squirmed out of the problem by having the tables of contents keyed in by hand and then providing our users with links to the OCR image of the scanned data. Since this is genealogy research, it at least gives them a way to verify what our searches return. But inevitably there are false hits as well as false misses. I've considered creating a dictionary of non-English words on the assumption that there will be a finite number of mis-spellings. But this is OCR data, so the set of misspelled words could very well be bigger than the total number of words in the English language, depending on the condition of your source and how well the OCR data is done. But, again, our situation is that the projects aren't large enough to make significant investments in even exploring this. I suppose that one could think about asking a Dictionary program for suggestions, but I haven't a clue how useful that would be. Especially for names or technical data. The LDS church (The Church of Jesus Christ of Latter-day Saints) is doing something interesting that has the flavor of [EMAIL PROTECTED] They're getting volunteers to key in pages. Two different volunteers key in each page. Then a comparison is done and the differences are arbitrated. As you can tell, I have nothing really useful to suggest on the scale you're talking about. 10^7 is a LOT of documents. But I'd also be very interested in anything you come across. Especially in the way of cleaning existing OCRd data. Mostly, I'm expressing sympathy for the size and complexity of the task you're undertaking <G>.. Best Erick On Jan 24, 2008 8:43 PM, Renaud Waldura <[EMAIL PROTECTED]> wrote: > I've been poking around the list archives and didn't really come up > against > anything interesting. Anyone using Lucene to index OCR text? Any > strategies/algorithms/packages you recommend? > > I have a large collection (10^7 docs) that's mostly the result of OCR. We > index/search/etc. with Lucene without any trouble, but OCR errors are a > problem, when doing exact phrase matches in particular. I'm looking for > ideas on how to deal with this thorny problem. > > -- > Renaud Waldura > Applications Group Manager > Library and Center for Knowledge Management > University of California, San Francisco > (415) 502-6660 > > >