Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese

Rubén Pérez Fri, 22 Jun 2012 10:00:57 -0700

Tobias,

I sent a mail to the committers list a while ago. It was a looong mail (I
know I can be too wordy most times), but I described in detail the problems
I found to i18n-ize the OCR and dictionary service.

If I didn't create tickets or continued with that topic was because we are
in the middle of a release, which comes from another release delayed *sine
die*. Modifying the Dictionary Service, and the OCR, implies some decision
making and a lot of refactoring, which, I think, is out of scope for the
time being.

I've got some ideas on the matter but I would like to discuss such
refactorization with the other developers/adopter who may be interested in
the topic so that we can make have an understanding of what's needed and
how it should be done. There are *lots* of things that can be done if there
are people and resources interested. This could even be an independent
project, such as OpenCaps used to be.

I can, indeed, create that JIRA ticket and post (an excerpt of) my mail
there. Though I prefer the email for discussions, and then tickets for
specific task and organization. I think I'll create the ticket and
resurface that email at the same time. I hope this is OK.

Re. the language parameter: as far as I know, the reason of ignoring it was
that the dictionary service could detect presentations with several
languages in them, because, I think, it decides which is the language in a
per-frame basis. The problem is that it makes that decision using the text
obtained for tesseract, which needs to know the language beforehand. If we
want to limit the text detection to the language specified in the
mediapackage, and at the same time we want to keep the ability of having
multilingual presentations (this is a manner of speaking, we cannot "keep"
what is not already there --the text extraction never worked well for us
nor I know cases of success in this area), we need to find a way to specify
*several* languages in a presentation, and have tesseract analyse the text
as many times as languages specified, and somehow filter the best results
for one language and the other. That's extremely complicated from my point
of view. Perhaps we can consider making a first version that allows only
one language, and, when that is *really* working, improving it to support
multiple languages.
There is still the question on *how* the language is specified. Is it in
natural language (Spanish, German)? If so, which language should we use,
the language name in English or in the language itself (Español, Deutsch,
Suomi)? Or should we use a language
code<http://en.wikipedia.org/wiki/Language_code>?
And, if so, which one? (es_ES, es, spa, etc.)

(Hey, who says I'm wordy? :P)

Anyway, I'll create the ticket and see if it draws some interest from the
adopters/developers.

Regards
Rubén

2012/6/22 Tobias Wunden <[email protected]>

> Hi Ruben,
>
> is there a ticket in Jira that describes your findings, and if not, do you
> mind creating one?
>
> Since we have an (optional) language field in dublin core as well as
> dictionaries which may help us detect the correct language, it should be
> possible for Matterhorn to specify the correct language parameter to
> Tesseract.
>
> Tobias
>
>
> On 22.06.2012, at 04:00, Rubén Pérez <[email protected]> wrote:
>
> Fernando,
>
> Currently the dictionary and OCR services don't support i18n very well. In
> fact, they don't give very good results with English either. See
> http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html .
>
> The main problem with tesseract is that the service doesn't really run
> tesseract with the correct parameters. tesseract accepts a "language"
> parameter which indicates which range of characters it should be expecting.
> Such parameter is not mandatory (it defaults to English), but obviously you
> need to specify it for any other language to get satisfactory results.
> Well, Matterhorn does NOT include such parameter and that's why it fails to
> detect Chinese ideograms.
>
> In Vigo we have, however, used a trick, so that tesseract uses the Spanish
> trained data instead of the English one: in the tesseract directory
> (normally /usr/local/share/tessdata) there is, at least, one file named
> "eng.traineddata", which tesseract uses by default. If you rename your
> Chinese .traineddata file to "eng.traineddata", then tesseract will use the
> Chinese characteres to detect the words.
>
> Re. the Chinese dictionary, I can just guess that the database has
> problems with the Chinese characters encoding, which use the highest codes
> in the standard (two bytes in length), while the occidental characters
> normally take the lowest (up to one byte). I wouldn't be surprised if the
> code assumed implicitly that all characters are 1-byte long, which
> obviously will break with UTF-8 2-byte-long characters, but I'm just
> guessing here and perhaps it's not the case. Feel free to file a bug,
> providing as much information as you can, if you cannot figure out why the
> Chinese dictionary isn't working (I believe there's not official Chinese
> dictionary in Matterhorn, so I'm assuming you created it yourself --you may
> as well include it in the ticket).
>
> Un saludo
> Rubén
>
> 2012/6/22 費納德費納德 <[email protected]>
>
>> Hello,
>>
>> we are trying to make matterhorh core server work with traditional
>> Chinese characters. But I can not find a way to achieve this target. Is
>> there a way to install Chinese traineddata with Tessearc engine? does
>> anybody know how to do this? Or does anybody succeeded installing another
>> language like Japanese or Simplified Chinese?
>>
>> If I am not wrong the Tesseract engine version installed is 3.0 so it
>> should support this feature. And another problem I found, how can I install
>> the Chinese dictionary? when I try to install it I get errors in the log
>> all the time, I suppose it is beacuse some issue with the character
>> codification.
>>
>> Regards,
>>
>> Fernando Hernandez
>>
>> _______________________________________________
>> Matterhorn-users mailing list
>> [email protected]
>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>>
>>
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>
>
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>
>

_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Re: [Matterhorn-users] [Matterhor​n-users] OCR - Tesseract engine with traditional Chinese

Reply via email to

Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese