[Opencast Matterhorn] Technical reflections about the OCR and its improvement

Rubén Pérez Fri, 27 Apr 2012 05:45:44 -0700

Dear all,

Some days ago I replied to one thread in this list which asked for methods
to improve the OCR performance, saying that we were dealing with some
problems in our local installation to make it work, but that I would report
my findings to the list. This is what this email is for.


I'm not intending to explain the exact mechanism that the text extraction
uses, but to share my findings on this topic. So, some of my impressions my
be incorrect or inaccurate, and I invite everybody to correct any mistake
they spot.

The first, basic problem, is the policy that the text extraction follows:

   1. tesseract (the OCR engine) is run (*with no parameters*) on a certain
   picture.
   2. The extracted words are analyzed and only those containing *
   alphanumeric* characters only are allowed. The rest is discarded.
   -- The notion of alphanumeric is key here. In the current
   implementation, the official Java notion of "alphanumeric" is used, i.e.
   the set [a-zA-Z0-9_], which correspond to the English alphabet (capitalized
   and uncapitalized), the numbers and (surprisingly enough) the underscore.
   3. The filtered words are matched to those contained in the DICTIONARY
   table in the database.
   -- The contents in this table are obtained from the .csv files the user
   drops in the folder etc/dictionaries, which basically consists of a list of
   words, the language each word belongs to, and a relative weight to show how
   frequent each word is with respect to the others.
   4. The language of the matching words is analysed. The most frequent
   language is assumed to be the language used in the picture, and only the
   words corresponding to that language are considered to be valid.
   -- For instance, if 75% of the matched words are English, and the rest
   are, let's say, Spanish and German, the text is assumed to be English, and
   the remaining 25% of non-English words are considered to be misdetected
   and, therefore, erroneous.


I think this procedure is incorrect in several ways:

   1. Tesseract relies heavily in the language of the detected pictures. In
   fact, the engine has to be trained with pictures containing words of a
   certain language, to finally obtain a set of configuration files *exclusive
   to that language*, which determine the way the engine decides that a
   certain region in the picture contains a certain character. Needless to
   say, tesseract won't detect characters which are not present in the
   language being detected. If not language is explicitly specified, Tesseract
   assumes English by default, which means that, for instance, it won't detect
   Spanish characters such as "camión" or my own name "Rubén".
   Because of this, the whole language detection system is flaw. If we
   wanted to implement such a thing correctly, we should run tesseract once
   for every language configured in the system, filter the results against the
   dictionary, and see which language had the most results, but it can't
   definitely be done in on go, as it is done now.
   There's a "language" parameter in the MediaPackage, but it is completely
   ignored in the process. This is understandable, since no syntax has been
   defined as to how to specify the MediaPackage language, among the many
   possibilities (the native name -"español"-, the English name -"Spanish"-,
   ISO-639-3 standard -"spa"-, a "culture code" -"es-ES", etc.). Perhaps this
   "language" metadata should be taken into account when running the OCR. This
   won't, however, correctly detect multiple languages in the same
   presentation, as the current implementation unsuccessfully intended to do.

   2. Assuming tesseract is configured for detecting a language other than
   English, when the text it returns is processed, some implicit assumptions
   are made which can completely spoil the results. When the text is divided
   in words, the current implementation considers word boundaries to be
   "[\\W]", which in Java means "non-alphanumeric characters", which in turn
   means "everything not in the set [a-zA-Z0-9_]". This, which may hold true
   for English, is completely erroneous for other language. For instance,
   Spanish words like "caña", "automático" or "cigüeña" will be incorrectly
   separated as "ca a", "autom tico" and "cig e a". German "ß" symbol or
   vowels with "umlaut" (ä ü ö) will have the same problems and I wonder how
   of if they have solved them. Other languages have an even wider set of
   characters.
   We have solved this problem in our installation by changing the
   appearances in the code of [\\W] by [^a-zA-Z0-9áéíóúÁÉÍÓÚüÜñÑ] in some
   cases, and by [\\s] (whitespaces) in others. However, it is evident that a
   more general solution needs to have some configuration values indicating
   which characters are valid in a certain language, or which ones can be
   considered word boundaries. In any case, it cannot be hard-coded.

   3. We created our own Spanish dictionary using wikipedia as described in
   the wiki, because the Spanish dictionary in the trunk is rather short.
   However, we didn't properly "clean" the contents, and non-Spanish words and
   garbage without sense appeared in the dictionary entries, too. Again, we
   had to filter out all those entries containing non-Spanish characters, and
   also applied other restrictions (the longest word in Spanish is said to be
   33 letters long, so all words longer than that where filtered, words with
   two accented vowels were discarded, etc.).
   I guess the bigger problem here is to obtain a good, comprehensive
   dictionary, so that correct words don't get filtered out just because they
   don't appear in the dictionary. This is specially critical in lectures,
   where very specific terms appear, which are difficult to find in the
   ordinary language, and most often are the key words of the lecture. Finding
   dictionaries or listings of highly technical words is not difficult for, I
   guess, most of languages, but real problem is assigning them a "weight"
   among the rest of the words.


I'm sorry for the length of the mail. For those braves who have arrived
here alive, I would like to hear your comments and opinions about the way
to *really* internationalize the text extraction and fix those problems,
specially 1. and 2.

Best regards
Rubén

_______________________________________________
Matterhorn mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn


To unsubscribe please email
[email protected]
_______________________________________________

[Opencast Matterhorn] Technical reflections about the OCR and its improvement

Reply via email to