Re: [ol-discuss] Recording the quality of a book's OCR

Lee Passey Thu, 05 Jan 2012 22:00:57 -0800

On 1/3/2012 3:49 PM, Laurence Penney wrote:

> Great stuff. It looks useful indeed.
>
> I see in the Thoreau that there are numerous cases where ‘ll’ is
> mistaken for ‘U’. It would be splendid if, after just a few of these
> were fixed manually, something could suggest performing numerous
> other replacements — particularly cases where ‘ll’ was already a
> candidate for the OCR of that word-part. Is this something that Abbyy
> can be induced to do?


Well yes, sort of, but no.

Oversimplifying greatly (so no nit-picking please) Artificial 
Intelligence can be broken down into two general categories: expert 
systems, and machine learning. Expert systems attempt to engineer a 
knowledge base of factual assertions, and then build an inference engine 
that draws conclusions based on a certain set of observed facts. Expert 
systems are highly deterministic, and tend to be inflexible.

Machine learning (epitomized by "neural networks") starts will little of 
no factual assertions, but relies on "observations", guesses and 
feedback to develop relationships. IBM's Watson is an example of machine 
learning.

Abbyy has a "learning" mode, where if you correct misinterpretations it 
will slowly learn, over time, to make that kind of error less and less. 
And while I am a partisan of "strong AI," I do believe that one of the 
major drawbacks of machine learning is the same as the drawback of human 
learning: it takes a lot of time, and the attention of dedicated teachers.

It appears to me that the principals at Internet Archive are way too 
impatient to dedicate the kind of time it would take to train the Abbyy 
recognition engine in any significant way. They do not want to produce 
thousands of high-quality digitizations over the course of a couple of 
years, they want millions of digitizations TODAY! regardless of quality. 
There is a clear preference for quantity over quality.

This of course does not mean I think we are doomed to deal with drek, 
merely that we cannot rely on Abbyy (or IA) to solve the problem for us. 
As you have probably figured out, I believe the first step in the 
solution is to come up with a way to derive truly useful (although by no 
means perfect) files from the best Abbyy files that IA has produced 
(*_abbyy.xml). I think that your notion of being able to grade IA's work 
product as to the quality of the recognized text is a good one, as it 
would allow us to start with the best texts, and perhaps identify others 
for rescan.

The tool I'm developing identifies each word as determined by Abbyy, and 
maintains Abbyy's annotations as to whether the word was found in the 
dictionary, whether it is the result of a suspicious recognition, or 
both. Perhaps the tool could generate an initial automated score by 
simply calculating the percentage of words witch were not found in the 
dictionary and what percentage are uncertain/suspicious. (The existence 
of en-dashes and em-dashes in text screws up Abbyy's analysis, so a 
secondary dictionary check will be required.)

The second step will be to develop automated methods to improve 
(although by no means to perfect) the derived files. This second step 
will no doubt be akin to an expert system where applicability of rules 
will be applied by an inference engine.

The third step will be to develop automated tools that can be operated 
under the guidance of human beings to further refine the improved files.

Lastly, the refined files will be passed to human proofreaders who will 
ensure that the final product is as close to accurate as is humanly 
possible.

Steps one and two can be fully automated, so they are repeatable even if 
not necessarily predictable. As you have pointed out in your most recent 
post, once human have intervened we should only return to the automated 
processes in the most dire circumstances, as human improvements are 
vastly more valuable than automated processes.

The problem you noted with ll being mis-recognized as U (as well as ii, 
il, and li) is a good candidate for our expert system. I would approach 
this problem (for English language texts) as follows:

Search the text portion of the file for any word that contains an upper 
case letter in any position other than the first. Flag that word as 
suspicious (this characterization needs to follow the word from this 
point on, even after correction, until it is removed by a human being). 
If the capital letter is a 'U' try different combinations of ii, il, li, 
and ll capturing those words which appear in our dictionary. If more 
than one word matches a dictionary word, select the word that appears 
most commonly in works of the corresponding era.

I'm sure you realize, that for this last decision to be made, 
collections need to be created of word commonality during different 
periods. I seem to recall a report on NPR last spring/summer about 
researchers (Harvard???) who have built these very kinds of lists from 
the Google Books archive.

A common mistake for Abbyy is to recognize 'm' for 'rn' (and while the 
reverse should be true, I've never actually seen it in practice). 
Sometimes the initial word is correct, but anachronistic; an example 
being "modem" (short for modulator/demodulator) and "modern." These kind 
of checks could be discovered by running an automated spell check with a 
period dictionary rather than a modern dictionary. For purposes of the 
public domain works at IA, I would suggest using the 1913 Webster's 
dictionary that is commonly available on the internet. Words which 
satisfy Abbyy's dictionary, but which cannot be found in Webster's, 
ought to be flagged as suspicious.

Another automated check that could be performed would be to run every 
word that fails the Abbyy spell check though a second test that scans 
failing words for lower case 'm', replaces the 'm' with 'rn' and checks 
/that/ word in the dictionary; if the second word is a dictionary word, 
replace the first with the second, flagging the word as suspicious for 
later validation. An example of this kind of test would be "bom" being 
produced for "born", or "com" for "corn."

Yet more tests could be developed if we had a dictionary that contained 
not only words, but their parts of speech. I would not feel comfortable 
replacing "die" with "the", no matter how often it seems to occur, but I 
/would/ feel comfortable replacing "die <noun>" with "the <noun>." 
(Again, recording for posterity the fact that the automated change was 
made.)

I'm sure there are hundreds, if not thousands, of these rules that could 
be suggested (dozens could be derived from _Walden_ alone). I would 
welcome, and encourage, any kind of effort that could lead to the 
development of an inference engine that could provide fully automated 
improvement of IA files once we are able to obtain useful output. 
Suggestions of software tools operated under the supervision of human 
users could also be useful.

Could IA or Open "Library" provide some sort of Wiki-like suggestion box 
where these ideas could be collected for the future?


_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to