On 1/3/2012 3:49 PM, Laurence Penney wrote: > Great stuff. It looks useful indeed. > > I see in the Thoreau that there are numerous cases where ‘ll’ is > mistaken for ‘U’. It would be splendid if, after just a few of these > were fixed manually, something could suggest performing numerous > other replacements — particularly cases where ‘ll’ was already a > candidate for the OCR of that word-part. Is this something that Abbyy > can be induced to do?
Well yes, sort of, but no. Oversimplifying greatly (so no nit-picking please) Artificial Intelligence can be broken down into two general categories: expert systems, and machine learning. Expert systems attempt to engineer a knowledge base of factual assertions, and then build an inference engine that draws conclusions based on a certain set of observed facts. Expert systems are highly deterministic, and tend to be inflexible. Machine learning (epitomized by "neural networks") starts will little of no factual assertions, but relies on "observations", guesses and feedback to develop relationships. IBM's Watson is an example of machine learning. Abbyy has a "learning" mode, where if you correct misinterpretations it will slowly learn, over time, to make that kind of error less and less. And while I am a partisan of "strong AI," I do believe that one of the major drawbacks of machine learning is the same as the drawback of human learning: it takes a lot of time, and the attention of dedicated teachers. It appears to me that the principals at Internet Archive are way too impatient to dedicate the kind of time it would take to train the Abbyy recognition engine in any significant way. They do not want to produce thousands of high-quality digitizations over the course of a couple of years, they want millions of digitizations TODAY! regardless of quality. There is a clear preference for quantity over quality. This of course does not mean I think we are doomed to deal with drek, merely that we cannot rely on Abbyy (or IA) to solve the problem for us. As you have probably figured out, I believe the first step in the solution is to come up with a way to derive truly useful (although by no means perfect) files from the best Abbyy files that IA has produced (*_abbyy.xml). I think that your notion of being able to grade IA's work product as to the quality of the recognized text is a good one, as it would allow us to start with the best texts, and perhaps identify others for rescan. The tool I'm developing identifies each word as determined by Abbyy, and maintains Abbyy's annotations as to whether the word was found in the dictionary, whether it is the result of a suspicious recognition, or both. Perhaps the tool could generate an initial automated score by simply calculating the percentage of words witch were not found in the dictionary and what percentage are uncertain/suspicious. (The existence of en-dashes and em-dashes in text screws up Abbyy's analysis, so a secondary dictionary check will be required.) The second step will be to develop automated methods to improve (although by no means to perfect) the derived files. This second step will no doubt be akin to an expert system where applicability of rules will be applied by an inference engine. The third step will be to develop automated tools that can be operated under the guidance of human beings to further refine the improved files. Lastly, the refined files will be passed to human proofreaders who will ensure that the final product is as close to accurate as is humanly possible. Steps one and two can be fully automated, so they are repeatable even if not necessarily predictable. As you have pointed out in your most recent post, once human have intervened we should only return to the automated processes in the most dire circumstances, as human improvements are vastly more valuable than automated processes. The problem you noted with ll being mis-recognized as U (as well as ii, il, and li) is a good candidate for our expert system. I would approach this problem (for English language texts) as follows: Search the text portion of the file for any word that contains an upper case letter in any position other than the first. Flag that word as suspicious (this characterization needs to follow the word from this point on, even after correction, until it is removed by a human being). If the capital letter is a 'U' try different combinations of ii, il, li, and ll capturing those words which appear in our dictionary. If more than one word matches a dictionary word, select the word that appears most commonly in works of the corresponding era. I'm sure you realize, that for this last decision to be made, collections need to be created of word commonality during different periods. I seem to recall a report on NPR last spring/summer about researchers (Harvard???) who have built these very kinds of lists from the Google Books archive. A common mistake for Abbyy is to recognize 'm' for 'rn' (and while the reverse should be true, I've never actually seen it in practice). Sometimes the initial word is correct, but anachronistic; an example being "modem" (short for modulator/demodulator) and "modern." These kind of checks could be discovered by running an automated spell check with a period dictionary rather than a modern dictionary. For purposes of the public domain works at IA, I would suggest using the 1913 Webster's dictionary that is commonly available on the internet. Words which satisfy Abbyy's dictionary, but which cannot be found in Webster's, ought to be flagged as suspicious. Another automated check that could be performed would be to run every word that fails the Abbyy spell check though a second test that scans failing words for lower case 'm', replaces the 'm' with 'rn' and checks /that/ word in the dictionary; if the second word is a dictionary word, replace the first with the second, flagging the word as suspicious for later validation. An example of this kind of test would be "bom" being produced for "born", or "com" for "corn." Yet more tests could be developed if we had a dictionary that contained not only words, but their parts of speech. I would not feel comfortable replacing "die" with "the", no matter how often it seems to occur, but I /would/ feel comfortable replacing "die <noun>" with "the <noun>." (Again, recording for posterity the fact that the automated change was made.) I'm sure there are hundreds, if not thousands, of these rules that could be suggested (dozens could be derived from _Walden_ alone). I would welcome, and encourage, any kind of effort that could lead to the development of an inference engine that could provide fully automated improvement of IA files once we are able to obtain useful output. Suggestions of software tools operated under the supervision of human users could also be useful. Could IA or Open "Library" provide some sort of Wiki-like suggestion box where these ideas could be collected for the future? _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
