Dear all, I'm resurfacing this email to point out there's a newly-created ticket in JIRA to keep track of whichever tasks or discussions are taken to improve the i18n in the dictionary and OCR services.
The ticket can be found here: http://opencast.jira.com/browse/MH-8918 Best regards Rubén 2012/4/27 Rubén Pérez <[email protected]> > Dear all, > > Some days ago I replied to one thread in this list which asked for methods > to improve the OCR performance, saying that we were dealing with some > problems in our local installation to make it work, but that I would report > my findings to the list. This is what this email is for. > > I'm not intending to explain the exact mechanism that the text extraction > uses, but to share my findings on this topic. So, some of my impressions my > be incorrect or inaccurate, and I invite everybody to correct any mistake > they spot. > > The first, basic problem, is the policy that the text extraction follows: > > 1. tesseract (the OCR engine) is run (*with no parameters*) on a > certain picture. > 2. The extracted words are analyzed and only those containing * > alphanumeric* characters only are allowed. The rest is discarded. > -- The notion of alphanumeric is key here. In the current > implementation, the official Java notion of "alphanumeric" is used, i.e. > the set [a-zA-Z0-9_], which correspond to the English alphabet (capitalized > and uncapitalized), the numbers and (surprisingly enough) the underscore. > 3. The filtered words are matched to those contained in the DICTIONARY > table in the database. > -- The contents in this table are obtained from the .csv files the > user drops in the folder etc/dictionaries, which basically consists of a > list of words, the language each word belongs to, and a relative weight to > show how frequent each word is with respect to the others. > 4. The language of the matching words is analysed. The most frequent > language is assumed to be the language used in the picture, and only the > words corresponding to that language are considered to be valid. > -- For instance, if 75% of the matched words are English, and the rest > are, let's say, Spanish and German, the text is assumed to be English, and > the remaining 25% of non-English words are considered to be misdetected > and, therefore, erroneous. > > > I think this procedure is incorrect in several ways: > > 1. Tesseract relies heavily in the language of the detected pictures. > In fact, the engine has to be trained with pictures containing words of a > certain language, to finally obtain a set of configuration files *exclusive > to that language*, which determine the way the engine decides that a > certain region in the picture contains a certain character. Needless to > say, tesseract won't detect characters which are not present in the > language being detected. If not language is explicitly specified, Tesseract > assumes English by default, which means that, for instance, it won't detect > Spanish characters such as "camión" or my own name "Rubén". > Because of this, the whole language detection system is flaw. If we > wanted to implement such a thing correctly, we should run tesseract once > for every language configured in the system, filter the results against the > dictionary, and see which language had the most results, but it can't > definitely be done in on go, as it is done now. > There's a "language" parameter in the MediaPackage, but it is > completely ignored in the process. This is understandable, since no syntax > has been defined as to how to specify the MediaPackage language, among the > many possibilities (the native name -"español"-, the English name > -"Spanish"-, ISO-639-3 standard -"spa"-, a "culture code" -"es-ES", etc.). > Perhaps this "language" metadata should be taken into account when running > the OCR. This won't, however, correctly detect multiple languages in the > same presentation, as the current implementation unsuccessfully intended to > do. > > 2. Assuming tesseract is configured for detecting a language other > than English, when the text it returns is processed, some implicit > assumptions are made which can completely spoil the results. When the text > is divided in words, the current implementation considers word boundaries > to be "[\\W]", which in Java means "non-alphanumeric characters", which in > turn means "everything not in the set [a-zA-Z0-9_]". This, which may hold > true for English, is completely erroneous for other language. For instance, > Spanish words like "caña", "automático" or "cigüeña" will be incorrectly > separated as "ca a", "autom tico" and "cig e a". German "ß" symbol or > vowels with "umlaut" (ä ü ö) will have the same problems and I wonder how > of if they have solved them. Other languages have an even wider set of > characters. > We have solved this problem in our installation by changing the > appearances in the code of [\\W] by [^a-zA-Z0-9áéíóúÁÉÍÓÚüÜñÑ] in some > cases, and by [\\s] (whitespaces) in others. However, it is evident that a > more general solution needs to have some configuration values indicating > which characters are valid in a certain language, or which ones can be > considered word boundaries. In any case, it cannot be hard-coded. > > 3. We created our own Spanish dictionary using wikipedia as described > in the wiki, because the Spanish dictionary in the trunk is rather short. > However, we didn't properly "clean" the contents, and non-Spanish words and > garbage without sense appeared in the dictionary entries, too. Again, we > had to filter out all those entries containing non-Spanish characters, and > also applied other restrictions (the longest word in Spanish is said to be > 33 letters long, so all words longer than that where filtered, words with > two accented vowels were discarded, etc.). > I guess the bigger problem here is to obtain a good, comprehensive > dictionary, so that correct words don't get filtered out just because they > don't appear in the dictionary. This is specially critical in lectures, > where very specific terms appear, which are difficult to find in the > ordinary language, and most often are the key words of the lecture. Finding > dictionaries or listings of highly technical words is not difficult for, I > guess, most of languages, but real problem is assigning them a "weight" > among the rest of the words. > > > I'm sorry for the length of the mail. For those braves who have arrived > here alive, I would like to hear your comments and opinions about the way > to *really* internationalize the text extraction and fix those problems, > specially 1. and 2. > > Best regards > Rubén >
_______________________________________________ Matterhorn mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn To unsubscribe please email [email protected] _______________________________________________
