Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese

Rubén Pérez Fri, 22 Jun 2012 10:06:10 -0700

Sorry for the double mail.

The mail I was referring to was sent to the Matterhorn list (not the
Committers list) and can be found here:
http://opencast.3480289.n2.nabble.com/Technical-reflections-about-the-OCR-and-its-improvement-tp7506108.html


Regards

2012/6/22 Rubén Pérez <[email protected]>

> Tobias,
>
> I sent a mail to the committers list a while ago. It was a looong mail (I
> know I can be too wordy most times), but I described in detail the problems
> I found to i18n-ize the OCR and dictionary service.
>
> If I didn't create tickets or continued with that topic was because we are
> in the middle of a release, which comes from another release delayed *sine
> die*. Modifying the Dictionary Service, and the OCR, implies some
> decision making and a lot of refactoring, which, I think, is out of scope
> for the time being.
>
> I've got some ideas on the matter but I would like to discuss such
> refactorization with the other developers/adopter who may be interested in
> the topic so that we can make have an understanding of what's needed and
> how it should be done. There are *lots* of things that can be done if there
> are people and resources interested. This could even be an independent
> project, such as OpenCaps used to be.
>
> I can, indeed, create that JIRA ticket and post (an excerpt of) my mail
> there. Though I prefer the email for discussions, and then tickets for
> specific task and organization. I think I'll create the ticket and
> resurface that email at the same time. I hope this is OK.
>
> Re. the language parameter: as far as I know, the reason of ignoring it
> was that the dictionary service could detect presentations with several
> languages in them, because, I think, it decides which is the language in a
> per-frame basis. The problem is that it makes that decision using the text
> obtained for tesseract, which needs to know the language beforehand. If we
> want to limit the text detection to the language specified in the
> mediapackage, and at the same time we want to keep the ability of having
> multilingual presentations (this is a manner of speaking, we cannot "keep"
> what is not already there --the text extraction never worked well for us
> nor I know cases of success in this area), we need to find a way to specify
> *several* languages in a presentation, and have tesseract analyse the text
> as many times as languages specified, and somehow filter the best results
> for one language and the other. That's extremely complicated from my point
> of view. Perhaps we can consider making a first version that allows only
> one language, and, when that is *really* working, improving it to support
> multiple languages.
> There is still the question on *how* the language is specified. Is it in
> natural language (Spanish, German)? If so, which language should we use,
> the language name in English or in the language itself (Español, Deutsch,
> Suomi)? Or should we use a language 
> code<http://en.wikipedia.org/wiki/Language_code>?
> And, if so, which one? (es_ES, es, spa, etc.)
>
> (Hey, who says I'm wordy? :P)
>
> Anyway, I'll create the ticket and see if it draws some interest from the
> adopters/developers.
>
> Regards
> Rubén
>
> 2012/6/22 Tobias Wunden <[email protected]>
>
>> Hi Ruben,
>>
>> is there a ticket in Jira that describes your findings, and if not, do
>> you mind creating one?
>>
>> Since we have an (optional) language field in dublin core as well as
>> dictionaries which may help us detect the correct language, it should be
>> possible for Matterhorn to specify the correct language parameter to
>> Tesseract.
>>
>> Tobias
>>
>>
>> On 22.06.2012, at 04:00, Rubén Pérez <[email protected]> wrote:
>>
>> Fernando,
>>
>> Currently the dictionary and OCR services don't support i18n very well.
>> In fact, they don't give very good results with English either. See
>> http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html .
>>
>> The main problem with tesseract is that the service doesn't really run
>> tesseract with the correct parameters. tesseract accepts a "language"
>> parameter which indicates which range of characters it should be expecting.
>> Such parameter is not mandatory (it defaults to English), but obviously you
>> need to specify it for any other language to get satisfactory results.
>> Well, Matterhorn does NOT include such parameter and that's why it fails to
>> detect Chinese ideograms.
>>
>> In Vigo we have, however, used a trick, so that tesseract uses the
>> Spanish trained data instead of the English one: in the tesseract directory
>> (normally /usr/local/share/tessdata) there is, at least, one file named
>> "eng.traineddata", which tesseract uses by default. If you rename your
>> Chinese .traineddata file to "eng.traineddata", then tesseract will use the
>> Chinese characteres to detect the words.
>>
>> Re. the Chinese dictionary, I can just guess that the database has
>> problems with the Chinese characters encoding, which use the highest codes
>> in the standard (two bytes in length), while the occidental characters
>> normally take the lowest (up to one byte). I wouldn't be surprised if the
>> code assumed implicitly that all characters are 1-byte long, which
>> obviously will break with UTF-8 2-byte-long characters, but I'm just
>> guessing here and perhaps it's not the case. Feel free to file a bug,
>> providing as much information as you can, if you cannot figure out why the
>> Chinese dictionary isn't working (I believe there's not official Chinese
>> dictionary in Matterhorn, so I'm assuming you created it yourself --you may
>> as well include it in the ticket).
>>
>> Un saludo
>> Rubén
>>
>> 2012/6/22 費納德費納德 <[email protected]>
>>
>>> Hello,
>>>
>>> we are trying to make matterhorh core server work with traditional
>>> Chinese characters. But I can not find a way to achieve this target. Is
>>> there a way to install Chinese traineddata with Tessearc engine? does
>>> anybody know how to do this? Or does anybody succeeded installing another
>>> language like Japanese or Simplified Chinese?
>>>
>>> If I am not wrong the Tesseract engine version installed is 3.0 so it
>>> should support this feature. And another problem I found, how can I install
>>> the Chinese dictionary? when I try to install it I get errors in the log
>>> all the time, I suppose it is beacuse some issue with the character
>>> codification.
>>>
>>> Regards,
>>>
>>> Fernando Hernandez
>>>
>>> _______________________________________________
>>> Matterhorn-users mailing list
>>> [email protected]
>>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>>>
>>>
>> _______________________________________________
>> Matterhorn-users mailing list
>> [email protected]
>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>>
>>
>> _______________________________________________
>> Matterhorn-users mailing list
>> [email protected]
>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>>
>>
>

_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Re: [Matterhorn-users] [Matterhor​n-users] OCR - Tesseract engine with traditional Chinese

Reply via email to

Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese