Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese

Rubén Pérez Thu, 21 Jun 2012 19:00:29 -0700

Fernando,

Currently the dictionary and OCR services don't support i18n very well. In
fact, they don't give very good results with English either. See
http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html .


The main problem with tesseract is that the service doesn't really run
tesseract with the correct parameters. tesseract accepts a "language"
parameter which indicates which range of characters it should be expecting.
Such parameter is not mandatory (it defaults to English), but obviously you
need to specify it for any other language to get satisfactory results.
Well, Matterhorn does NOT include such parameter and that's why it fails to
detect Chinese ideograms.

In Vigo we have, however, used a trick, so that tesseract uses the Spanish
trained data instead of the English one: in the tesseract directory
(normally /usr/local/share/tessdata) there is, at least, one file named
"eng.traineddata", which tesseract uses by default. If you rename your
Chinese .traineddata file to "eng.traineddata", then tesseract will use the
Chinese characteres to detect the words.

Re. the Chinese dictionary, I can just guess that the database has problems
with the Chinese characters encoding, which use the highest codes in the
standard (two bytes in length), while the occidental characters normally
take the lowest (up to one byte). I wouldn't be surprised if the code
assumed implicitly that all characters are 1-byte long, which obviously
will break with UTF-8 2-byte-long characters, but I'm just guessing here
and perhaps it's not the case. Feel free to file a bug, providing as much
information as you can, if you cannot figure out why the Chinese dictionary
isn't working (I believe there's not official Chinese dictionary in
Matterhorn, so I'm assuming you created it yourself --you may as well
include it in the ticket).

Un saludo
Rubén

2012/6/22 費納德費納德 <[email protected]>

> Hello,
>
> we are trying to make matterhorh core server work with traditional Chinese
> characters. But I can not find a way to achieve this target. Is there a way
> to install Chinese traineddata with Tessearc engine? does anybody know how
> to do this? Or does anybody succeeded installing another language like
> Japanese or Simplified Chinese?
>
> If I am not wrong the Tesseract engine version installed is 3.0 so it
> should support this feature. And another problem I found, how can I install
> the Chinese dictionary? when I try to install it I get errors in the log
> all the time, I suppose it is beacuse some issue with the character
> codification.
>
> Regards,
>
> Fernando Hernandez
>
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>
>

_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Re: [Matterhorn-users] [Matterhor​n-users] OCR - Tesseract engine with traditional Chinese

Reply via email to

Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese