Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese

Tobias Wunden Fri, 22 Jun 2012 05:36:47 -0700

Hi Ruben,

is there a ticket in Jira that describes your findings, and if not, do you mind 
creating one?


Since we have an (optional) language field in dublin core as well as 
dictionaries which may help us detect the correct language, it should be 
possible for Matterhorn to specify the correct language parameter to Tesseract.

Tobias

On 22.06.2012, at 04:00, Rubén Pérez <[email protected]> wrote:

> Fernando,
> 
> Currently the dictionary and OCR services don't support i18n very well. In 
> fact, they don't give very good results with English either. See 
> http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html .
> 
> The main problem with tesseract is that the service doesn't really run 
> tesseract with the correct parameters. tesseract accepts a "language" 
> parameter which indicates which range of characters it should be expecting. 
> Such parameter is not mandatory (it defaults to English), but obviously you 
> need to specify it for any other language to get satisfactory results. Well, 
> Matterhorn does NOT include such parameter and that's why it fails to detect 
> Chinese ideograms. 
> 
> In Vigo we have, however, used a trick, so that tesseract uses the Spanish 
> trained data instead of the English one: in the tesseract directory (normally 
> /usr/local/share/tessdata) there is, at least, one file named 
> "eng.traineddata", which tesseract uses by default. If you rename your 
> Chinese .traineddata file to "eng.traineddata", then tesseract will use the 
> Chinese characteres to detect the words.
> 
> Re. the Chinese dictionary, I can just guess that the database has problems 
> with the Chinese characters encoding, which use the highest codes in the 
> standard (two bytes in length), while the occidental characters normally take 
> the lowest (up to one byte). I wouldn't be surprised if the code assumed 
> implicitly that all characters are 1-byte long, which obviously will break 
> with UTF-8 2-byte-long characters, but I'm just guessing here and perhaps 
> it's not the case. Feel free to file a bug, providing as much information as 
> you can, if you cannot figure out why the Chinese dictionary isn't working (I 
> believe there's not official Chinese dictionary in Matterhorn, so I'm 
> assuming you created it yourself --you may as well include it in the ticket).
> 
> Un saludo
> Rubén
> 
> 2012/6/22 費納德費納德 <[email protected]>
> Hello,
>  
> we are trying to make matterhorh core server work with traditional Chinese 
> characters. But I can not find a way to achieve this target. Is there a way 
> to install Chinese traineddata with Tessearc engine? does anybody know how to 
> do this? Or does anybody succeeded installing another language like Japanese 
> or Simplified Chinese?
>  
> If I am not wrong the Tesseract engine version installed is 3.0 so it should 
> support this feature. And another problem I found, how can I install the 
> Chinese dictionary? when I try to install it I get errors in the log all the 
> time, I suppose it is beacuse some issue with the character codification.
>  
> Regards,
>  
> Fernando Hernandez
> 
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
> 
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Re: [Matterhorn-users] [Matterhor​n-users] OCR - Tesseract engine with traditional Chinese

Reply via email to

Re: [Matterhorn-users] [Matterhorn-users] OCR - Tesseract engine with traditional Chinese