Tobias, I sent a mail to the committers list a while ago. It was a looong mail (I know I can be too wordy most times), but I described in detail the problems I found to i18n-ize the OCR and dictionary service.
If I didn't create tickets or continued with that topic was because we are in the middle of a release, which comes from another release delayed *sine die*. Modifying the Dictionary Service, and the OCR, implies some decision making and a lot of refactoring, which, I think, is out of scope for the time being. I've got some ideas on the matter but I would like to discuss such refactorization with the other developers/adopter who may be interested in the topic so that we can make have an understanding of what's needed and how it should be done. There are *lots* of things that can be done if there are people and resources interested. This could even be an independent project, such as OpenCaps used to be. I can, indeed, create that JIRA ticket and post (an excerpt of) my mail there. Though I prefer the email for discussions, and then tickets for specific task and organization. I think I'll create the ticket and resurface that email at the same time. I hope this is OK. Re. the language parameter: as far as I know, the reason of ignoring it was that the dictionary service could detect presentations with several languages in them, because, I think, it decides which is the language in a per-frame basis. The problem is that it makes that decision using the text obtained for tesseract, which needs to know the language beforehand. If we want to limit the text detection to the language specified in the mediapackage, and at the same time we want to keep the ability of having multilingual presentations (this is a manner of speaking, we cannot "keep" what is not already there --the text extraction never worked well for us nor I know cases of success in this area), we need to find a way to specify *several* languages in a presentation, and have tesseract analyse the text as many times as languages specified, and somehow filter the best results for one language and the other. That's extremely complicated from my point of view. Perhaps we can consider making a first version that allows only one language, and, when that is *really* working, improving it to support multiple languages. There is still the question on *how* the language is specified. Is it in natural language (Spanish, German)? If so, which language should we use, the language name in English or in the language itself (Español, Deutsch, Suomi)? Or should we use a language code<http://en.wikipedia.org/wiki/Language_code>? And, if so, which one? (es_ES, es, spa, etc.) (Hey, who says I'm wordy? :P) Anyway, I'll create the ticket and see if it draws some interest from the adopters/developers. Regards Rubén 2012/6/22 Tobias Wunden <[email protected]> > Hi Ruben, > > is there a ticket in Jira that describes your findings, and if not, do you > mind creating one? > > Since we have an (optional) language field in dublin core as well as > dictionaries which may help us detect the correct language, it should be > possible for Matterhorn to specify the correct language parameter to > Tesseract. > > Tobias > > > On 22.06.2012, at 04:00, Rubén Pérez <[email protected]> wrote: > > Fernando, > > Currently the dictionary and OCR services don't support i18n very well. In > fact, they don't give very good results with English either. See > http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html . > > The main problem with tesseract is that the service doesn't really run > tesseract with the correct parameters. tesseract accepts a "language" > parameter which indicates which range of characters it should be expecting. > Such parameter is not mandatory (it defaults to English), but obviously you > need to specify it for any other language to get satisfactory results. > Well, Matterhorn does NOT include such parameter and that's why it fails to > detect Chinese ideograms. > > In Vigo we have, however, used a trick, so that tesseract uses the Spanish > trained data instead of the English one: in the tesseract directory > (normally /usr/local/share/tessdata) there is, at least, one file named > "eng.traineddata", which tesseract uses by default. If you rename your > Chinese .traineddata file to "eng.traineddata", then tesseract will use the > Chinese characteres to detect the words. > > Re. the Chinese dictionary, I can just guess that the database has > problems with the Chinese characters encoding, which use the highest codes > in the standard (two bytes in length), while the occidental characters > normally take the lowest (up to one byte). I wouldn't be surprised if the > code assumed implicitly that all characters are 1-byte long, which > obviously will break with UTF-8 2-byte-long characters, but I'm just > guessing here and perhaps it's not the case. Feel free to file a bug, > providing as much information as you can, if you cannot figure out why the > Chinese dictionary isn't working (I believe there's not official Chinese > dictionary in Matterhorn, so I'm assuming you created it yourself --you may > as well include it in the ticket). > > Un saludo > Rubén > > 2012/6/22 費納德費納德 <[email protected]> > >> Hello, >> >> we are trying to make matterhorh core server work with traditional >> Chinese characters. But I can not find a way to achieve this target. Is >> there a way to install Chinese traineddata with Tessearc engine? does >> anybody know how to do this? Or does anybody succeeded installing another >> language like Japanese or Simplified Chinese? >> >> If I am not wrong the Tesseract engine version installed is 3.0 so it >> should support this feature. And another problem I found, how can I install >> the Chinese dictionary? when I try to install it I get errors in the log >> all the time, I suppose it is beacuse some issue with the character >> codification. >> >> Regards, >> >> Fernando Hernandez >> >> _______________________________________________ >> Matterhorn-users mailing list >> [email protected] >> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >> >> > _______________________________________________ > Matterhorn-users mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users > > > _______________________________________________ > Matterhorn-users mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users > >
_______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
