Re: [Opencast Matterhorn] Technical reflections about the OCR and its improvement

Rubén Pérez Fri, 22 Jun 2012 10:22:11 -0700

Dear all,

I'm resurfacing this email to point out there's a newly-created ticket in
JIRA to keep track of whichever tasks or discussions are taken to improve
the i18n in the dictionary and OCR services.


The ticket can be found here: http://opencast.jira.com/browse/MH-8918

Best regards
Rubén

2012/4/27 Rubén Pérez <[email protected]>

> Dear all,
>
> Some days ago I replied to one thread in this list which asked for methods
> to improve the OCR performance, saying that we were dealing with some
> problems in our local installation to make it work, but that I would report
> my findings to the list. This is what this email is for.
>
> I'm not intending to explain the exact mechanism that the text extraction
> uses, but to share my findings on this topic. So, some of my impressions my
> be incorrect or inaccurate, and I invite everybody to correct any mistake
> they spot.
>
> The first, basic problem, is the policy that the text extraction follows:
>
>    1. tesseract (the OCR engine) is run (*with no parameters*) on a
>    certain picture.
>    2. The extracted words are analyzed and only those containing *
>    alphanumeric* characters only are allowed. The rest is discarded.
>    -- The notion of alphanumeric is key here. In the current
>    implementation, the official Java notion of "alphanumeric" is used, i.e.
>    the set [a-zA-Z0-9_], which correspond to the English alphabet (capitalized
>    and uncapitalized), the numbers and (surprisingly enough) the underscore.
>    3. The filtered words are matched to those contained in the DICTIONARY
>    table in the database.
>    -- The contents in this table are obtained from the .csv files the
>    user drops in the folder etc/dictionaries, which basically consists of a
>    list of words, the language each word belongs to, and a relative weight to
>    show how frequent each word is with respect to the others.
>    4. The language of the matching words is analysed. The most frequent
>    language is assumed to be the language used in the picture, and only the
>    words corresponding to that language are considered to be valid.
>    -- For instance, if 75% of the matched words are English, and the rest
>    are, let's say, Spanish and German, the text is assumed to be English, and
>    the remaining 25% of non-English words are considered to be misdetected
>    and, therefore, erroneous.
>
>
> I think this procedure is incorrect in several ways:
>
>    1. Tesseract relies heavily in the language of the detected pictures.
>    In fact, the engine has to be trained with pictures containing words of a
>    certain language, to finally obtain a set of configuration files *exclusive
>    to that language*, which determine the way the engine decides that a
>    certain region in the picture contains a certain character. Needless to
>    say, tesseract won't detect characters which are not present in the
>    language being detected. If not language is explicitly specified, Tesseract
>    assumes English by default, which means that, for instance, it won't detect
>    Spanish characters such as "camión" or my own name "Rubén".
>    Because of this, the whole language detection system is flaw. If we
>    wanted to implement such a thing correctly, we should run tesseract once
>    for every language configured in the system, filter the results against the
>    dictionary, and see which language had the most results, but it can't
>    definitely be done in on go, as it is done now.
>    There's a "language" parameter in the MediaPackage, but it is
>    completely ignored in the process. This is understandable, since no syntax
>    has been defined as to how to specify the MediaPackage language, among the
>    many possibilities (the native name -"español"-, the English name
>    -"Spanish"-, ISO-639-3 standard -"spa"-, a "culture code" -"es-ES", etc.).
>    Perhaps this "language" metadata should be taken into account when running
>    the OCR. This won't, however, correctly detect multiple languages in the
>    same presentation, as the current implementation unsuccessfully intended to
>    do.
>
>    2. Assuming tesseract is configured for detecting a language other
>    than English, when the text it returns is processed, some implicit
>    assumptions are made which can completely spoil the results. When the text
>    is divided in words, the current implementation considers word boundaries
>    to be "[\\W]", which in Java means "non-alphanumeric characters", which in
>    turn means "everything not in the set [a-zA-Z0-9_]". This, which may hold
>    true for English, is completely erroneous for other language. For instance,
>    Spanish words like "caña", "automático" or "cigüeña" will be incorrectly
>    separated as "ca a", "autom tico" and "cig e a". German "ß" symbol or
>    vowels with "umlaut" (ä ü ö) will have the same problems and I wonder how
>    of if they have solved them. Other languages have an even wider set of
>    characters.
>    We have solved this problem in our installation by changing the
>    appearances in the code of [\\W] by [^a-zA-Z0-9áéíóúÁÉÍÓÚüÜñÑ] in some
>    cases, and by [\\s] (whitespaces) in others. However, it is evident that a
>    more general solution needs to have some configuration values indicating
>    which characters are valid in a certain language, or which ones can be
>    considered word boundaries. In any case, it cannot be hard-coded.
>
>    3. We created our own Spanish dictionary using wikipedia as described
>    in the wiki, because the Spanish dictionary in the trunk is rather short.
>    However, we didn't properly "clean" the contents, and non-Spanish words and
>    garbage without sense appeared in the dictionary entries, too. Again, we
>    had to filter out all those entries containing non-Spanish characters, and
>    also applied other restrictions (the longest word in Spanish is said to be
>    33 letters long, so all words longer than that where filtered, words with
>    two accented vowels were discarded, etc.).
>    I guess the bigger problem here is to obtain a good, comprehensive
>    dictionary, so that correct words don't get filtered out just because they
>    don't appear in the dictionary. This is specially critical in lectures,
>    where very specific terms appear, which are difficult to find in the
>    ordinary language, and most often are the key words of the lecture. Finding
>    dictionaries or listings of highly technical words is not difficult for, I
>    guess, most of languages, but real problem is assigning them a "weight"
>    among the rest of the words.
>
>
> I'm sorry for the length of the mail. For those braves who have arrived
> here alive, I would like to hear your comments and opinions about the way
> to *really* internationalize the text extraction and fix those problems,
> specially 1. and 2.
>
> Best regards
> Rubén
>

_______________________________________________
Matterhorn mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn


To unsubscribe please email
[email protected]
_______________________________________________

Re: [Opencast Matterhorn] Technical reflections about the OCR and its improvement

Reply via email to