Thanks a lot Rubén, I will try the trick you told me. I have already read some of your comments on other issues and you are always really helpful.
And about the dictionary, you are right I created it form the wikipedia as it is explained in MH docs. I have checked it with Notepad++ and the characters are perfectly stored. So I suppose it is an encoding issue and I will try to solve it and open a ticket. Gracias y un saludo. Fernando Hernández. 2012/6/22 Rubén Pérez <[email protected]> > Fernando, > > Currently the dictionary and OCR services don't support i18n very well. In > fact, they don't give very good results with English either. See > http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html . > > The main problem with tesseract is that the service doesn't really run > tesseract with the correct parameters. tesseract accepts a "language" > parameter which indicates which range of characters it should be expecting. > Such parameter is not mandatory (it defaults to English), but obviously you > need to specify it for any other language to get satisfactory results. > Well, Matterhorn does NOT include such parameter and that's why it fails to > detect Chinese ideograms. > > In Vigo we have, however, used a trick, so that tesseract uses the Spanish > trained data instead of the English one: in the tesseract directory > (normally /usr/local/share/tessdata) there is, at least, one file named > "eng.traineddata", which tesseract uses by default. If you rename your > Chinese .traineddata file to "eng.traineddata", then tesseract will use the > Chinese characteres to detect the words. > > Re. the Chinese dictionary, I can just guess that the database has > problems with the Chinese characters encoding, which use the highest codes > in the standard (two bytes in length), while the occidental characters > normally take the lowest (up to one byte). I wouldn't be surprised if the > code assumed implicitly that all characters are 1-byte long, which > obviously will break with UTF-8 2-byte-long characters, but I'm just > guessing here and perhaps it's not the case. Feel free to file a bug, > providing as much information as you can, if you cannot figure out why the > Chinese dictionary isn't working (I believe there's not official Chinese > dictionary in Matterhorn, so I'm assuming you created it yourself --you may > as well include it in the ticket). > > Un saludo > Rubén > > 2012/6/22 費納德費納德 <[email protected]> > >> Hello, >> >> we are trying to make matterhorh core server work with traditional >> Chinese characters. But I can not find a way to achieve this target. Is >> there a way to install Chinese traineddata with Tessearc engine? does >> anybody know how to do this? Or does anybody succeeded installing another >> language like Japanese or Simplified Chinese? >> >> If I am not wrong the Tesseract engine version installed is 3.0 so it >> should support this feature. And another problem I found, how can I install >> the Chinese dictionary? when I try to install it I get errors in the log >> all the time, I suppose it is beacuse some issue with the character >> codification. >> >> Regards, >> >> Fernando Hernandez >> >> _______________________________________________ >> Matterhorn-users mailing list >> [email protected] >> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >> >> > > _______________________________________________ > Matterhorn-users mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users > >
_______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
