Fernando, Currently the dictionary and OCR services don't support i18n very well. In fact, they don't give very good results with English either. See http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html .
The main problem with tesseract is that the service doesn't really run tesseract with the correct parameters. tesseract accepts a "language" parameter which indicates which range of characters it should be expecting. Such parameter is not mandatory (it defaults to English), but obviously you need to specify it for any other language to get satisfactory results. Well, Matterhorn does NOT include such parameter and that's why it fails to detect Chinese ideograms. In Vigo we have, however, used a trick, so that tesseract uses the Spanish trained data instead of the English one: in the tesseract directory (normally /usr/local/share/tessdata) there is, at least, one file named "eng.traineddata", which tesseract uses by default. If you rename your Chinese .traineddata file to "eng.traineddata", then tesseract will use the Chinese characteres to detect the words. Re. the Chinese dictionary, I can just guess that the database has problems with the Chinese characters encoding, which use the highest codes in the standard (two bytes in length), while the occidental characters normally take the lowest (up to one byte). I wouldn't be surprised if the code assumed implicitly that all characters are 1-byte long, which obviously will break with UTF-8 2-byte-long characters, but I'm just guessing here and perhaps it's not the case. Feel free to file a bug, providing as much information as you can, if you cannot figure out why the Chinese dictionary isn't working (I believe there's not official Chinese dictionary in Matterhorn, so I'm assuming you created it yourself --you may as well include it in the ticket). Un saludo Rubén 2012/6/22 費納德費納德 <[email protected]> > Hello, > > we are trying to make matterhorh core server work with traditional Chinese > characters. But I can not find a way to achieve this target. Is there a way > to install Chinese traineddata with Tessearc engine? does anybody know how > to do this? Or does anybody succeeded installing another language like > Japanese or Simplified Chinese? > > If I am not wrong the Tesseract engine version installed is 3.0 so it > should support this feature. And another problem I found, how can I install > the Chinese dictionary? when I try to install it I get errors in the log > all the time, I suppose it is beacuse some issue with the character > codification. > > Regards, > > Fernando Hernandez > > _______________________________________________ > Matterhorn-users mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users > >
_______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
