Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese

費納德費納德 Thu, 21 Jun 2012 19:27:33 -0700

Thanks a lot Rubén, I will try the trick you told me. I have already read
some of your comments on other issues and you are always really helpful.



And about the dictionary, you are right I created it form the wikipedia as
it is explained in MH docs. I have checked it with Notepad++ and the
characters are perfectly stored. So I suppose it is an encoding issue and I
will try to solve it and open a ticket.


Gracias y un saludo.

Fernando Hernández.


2012/6/22 Rubén Pérez <[email protected]>

> Fernando,
>
> Currently the dictionary and OCR services don't support i18n very well. In
> fact, they don't give very good results with English either. See
> http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html .
>
> The main problem with tesseract is that the service doesn't really run
> tesseract with the correct parameters. tesseract accepts a "language"
> parameter which indicates which range of characters it should be expecting.
> Such parameter is not mandatory (it defaults to English), but obviously you
> need to specify it for any other language to get satisfactory results.
> Well, Matterhorn does NOT include such parameter and that's why it fails to
> detect Chinese ideograms.
>
> In Vigo we have, however, used a trick, so that tesseract uses the Spanish
> trained data instead of the English one: in the tesseract directory
> (normally /usr/local/share/tessdata) there is, at least, one file named
> "eng.traineddata", which tesseract uses by default. If you rename your
> Chinese .traineddata file to "eng.traineddata", then tesseract will use the
> Chinese characteres to detect the words.
>
> Re. the Chinese dictionary, I can just guess that the database has
> problems with the Chinese characters encoding, which use the highest codes
> in the standard (two bytes in length), while the occidental characters
> normally take the lowest (up to one byte). I wouldn't be surprised if the
> code assumed implicitly that all characters are 1-byte long, which
> obviously will break with UTF-8 2-byte-long characters, but I'm just
> guessing here and perhaps it's not the case. Feel free to file a bug,
> providing as much information as you can, if you cannot figure out why the
> Chinese dictionary isn't working (I believe there's not official Chinese
> dictionary in Matterhorn, so I'm assuming you created it yourself --you may
> as well include it in the ticket).
>
> Un saludo
> Rubén
>
> 2012/6/22 費納德費納德 <[email protected]>
>
>> Hello,
>>
>> we are trying to make matterhorh core server work with traditional
>> Chinese characters. But I can not find a way to achieve this target. Is
>> there a way to install Chinese traineddata with Tessearc engine? does
>> anybody know how to do this? Or does anybody succeeded installing another
>> language like Japanese or Simplified Chinese?
>>
>> If I am not wrong the Tesseract engine version installed is 3.0 so it
>> should support this feature. And another problem I found, how can I install
>> the Chinese dictionary? when I try to install it I get errors in the log
>> all the time, I suppose it is beacuse some issue with the character
>> codification.
>>
>> Regards,
>>
>> Fernando Hernandez
>>
>> _______________________________________________
>> Matterhorn-users mailing list
>> [email protected]
>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>>
>>
>
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>
>

_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Re: [Matterhorn-users] [Matterhor​n-users] OCR - Tesseract engine with traditional Chinese

Reply via email to

Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese