Dear all, Some days ago I replied to one thread in this list which asked for methods to improve the OCR performance, saying that we were dealing with some problems in our local installation to make it work, but that I would report my findings to the list. This is what this email is for.
I'm not intending to explain the exact mechanism that the text extraction uses, but to share my findings on this topic. So, some of my impressions my be incorrect or inaccurate, and I invite everybody to correct any mistake they spot. The first, basic problem, is the policy that the text extraction follows: 1. tesseract (the OCR engine) is run (*with no parameters*) on a certain picture. 2. The extracted words are analyzed and only those containing * alphanumeric* characters only are allowed. The rest is discarded. -- The notion of alphanumeric is key here. In the current implementation, the official Java notion of "alphanumeric" is used, i.e. the set [a-zA-Z0-9_], which correspond to the English alphabet (capitalized and uncapitalized), the numbers and (surprisingly enough) the underscore. 3. The filtered words are matched to those contained in the DICTIONARY table in the database. -- The contents in this table are obtained from the .csv files the user drops in the folder etc/dictionaries, which basically consists of a list of words, the language each word belongs to, and a relative weight to show how frequent each word is with respect to the others. 4. The language of the matching words is analysed. The most frequent language is assumed to be the language used in the picture, and only the words corresponding to that language are considered to be valid. -- For instance, if 75% of the matched words are English, and the rest are, let's say, Spanish and German, the text is assumed to be English, and the remaining 25% of non-English words are considered to be misdetected and, therefore, erroneous. I think this procedure is incorrect in several ways: 1. Tesseract relies heavily in the language of the detected pictures. In fact, the engine has to be trained with pictures containing words of a certain language, to finally obtain a set of configuration files *exclusive to that language*, which determine the way the engine decides that a certain region in the picture contains a certain character. Needless to say, tesseract won't detect characters which are not present in the language being detected. If not language is explicitly specified, Tesseract assumes English by default, which means that, for instance, it won't detect Spanish characters such as "camión" or my own name "Rubén". Because of this, the whole language detection system is flaw. If we wanted to implement such a thing correctly, we should run tesseract once for every language configured in the system, filter the results against the dictionary, and see which language had the most results, but it can't definitely be done in on go, as it is done now. There's a "language" parameter in the MediaPackage, but it is completely ignored in the process. This is understandable, since no syntax has been defined as to how to specify the MediaPackage language, among the many possibilities (the native name -"español"-, the English name -"Spanish"-, ISO-639-3 standard -"spa"-, a "culture code" -"es-ES", etc.). Perhaps this "language" metadata should be taken into account when running the OCR. This won't, however, correctly detect multiple languages in the same presentation, as the current implementation unsuccessfully intended to do. 2. Assuming tesseract is configured for detecting a language other than English, when the text it returns is processed, some implicit assumptions are made which can completely spoil the results. When the text is divided in words, the current implementation considers word boundaries to be "[\\W]", which in Java means "non-alphanumeric characters", which in turn means "everything not in the set [a-zA-Z0-9_]". This, which may hold true for English, is completely erroneous for other language. For instance, Spanish words like "caña", "automático" or "cigüeña" will be incorrectly separated as "ca a", "autom tico" and "cig e a". German "ß" symbol or vowels with "umlaut" (ä ü ö) will have the same problems and I wonder how of if they have solved them. Other languages have an even wider set of characters. We have solved this problem in our installation by changing the appearances in the code of [\\W] by [^a-zA-Z0-9áéíóúÁÉÍÓÚüÜñÑ] in some cases, and by [\\s] (whitespaces) in others. However, it is evident that a more general solution needs to have some configuration values indicating which characters are valid in a certain language, or which ones can be considered word boundaries. In any case, it cannot be hard-coded. 3. We created our own Spanish dictionary using wikipedia as described in the wiki, because the Spanish dictionary in the trunk is rather short. However, we didn't properly "clean" the contents, and non-Spanish words and garbage without sense appeared in the dictionary entries, too. Again, we had to filter out all those entries containing non-Spanish characters, and also applied other restrictions (the longest word in Spanish is said to be 33 letters long, so all words longer than that where filtered, words with two accented vowels were discarded, etc.). I guess the bigger problem here is to obtain a good, comprehensive dictionary, so that correct words don't get filtered out just because they don't appear in the dictionary. This is specially critical in lectures, where very specific terms appear, which are difficult to find in the ordinary language, and most often are the key words of the lecture. Finding dictionaries or listings of highly technical words is not difficult for, I guess, most of languages, but real problem is assigning them a "weight" among the rest of the words. I'm sorry for the length of the mail. For those braves who have arrived here alive, I would like to hear your comments and opinions about the way to *really* internationalize the text extraction and fix those problems, specially 1. and 2. Best regards Rubén
_______________________________________________ Matterhorn mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn To unsubscribe please email [email protected] _______________________________________________
