Sorry for the double mail. The mail I was referring to was sent to the Matterhorn list (not the Committers list) and can be found here: http://opencast.3480289.n2.nabble.com/Technical-reflections-about-the-OCR-and-its-improvement-tp7506108.html
Regards 2012/6/22 Rubén Pérez <[email protected]> > Tobias, > > I sent a mail to the committers list a while ago. It was a looong mail (I > know I can be too wordy most times), but I described in detail the problems > I found to i18n-ize the OCR and dictionary service. > > If I didn't create tickets or continued with that topic was because we are > in the middle of a release, which comes from another release delayed *sine > die*. Modifying the Dictionary Service, and the OCR, implies some > decision making and a lot of refactoring, which, I think, is out of scope > for the time being. > > I've got some ideas on the matter but I would like to discuss such > refactorization with the other developers/adopter who may be interested in > the topic so that we can make have an understanding of what's needed and > how it should be done. There are *lots* of things that can be done if there > are people and resources interested. This could even be an independent > project, such as OpenCaps used to be. > > I can, indeed, create that JIRA ticket and post (an excerpt of) my mail > there. Though I prefer the email for discussions, and then tickets for > specific task and organization. I think I'll create the ticket and > resurface that email at the same time. I hope this is OK. > > Re. the language parameter: as far as I know, the reason of ignoring it > was that the dictionary service could detect presentations with several > languages in them, because, I think, it decides which is the language in a > per-frame basis. The problem is that it makes that decision using the text > obtained for tesseract, which needs to know the language beforehand. If we > want to limit the text detection to the language specified in the > mediapackage, and at the same time we want to keep the ability of having > multilingual presentations (this is a manner of speaking, we cannot "keep" > what is not already there --the text extraction never worked well for us > nor I know cases of success in this area), we need to find a way to specify > *several* languages in a presentation, and have tesseract analyse the text > as many times as languages specified, and somehow filter the best results > for one language and the other. That's extremely complicated from my point > of view. Perhaps we can consider making a first version that allows only > one language, and, when that is *really* working, improving it to support > multiple languages. > There is still the question on *how* the language is specified. Is it in > natural language (Spanish, German)? If so, which language should we use, > the language name in English or in the language itself (Español, Deutsch, > Suomi)? Or should we use a language > code<http://en.wikipedia.org/wiki/Language_code>? > And, if so, which one? (es_ES, es, spa, etc.) > > (Hey, who says I'm wordy? :P) > > Anyway, I'll create the ticket and see if it draws some interest from the > adopters/developers. > > Regards > Rubén > > 2012/6/22 Tobias Wunden <[email protected]> > >> Hi Ruben, >> >> is there a ticket in Jira that describes your findings, and if not, do >> you mind creating one? >> >> Since we have an (optional) language field in dublin core as well as >> dictionaries which may help us detect the correct language, it should be >> possible for Matterhorn to specify the correct language parameter to >> Tesseract. >> >> Tobias >> >> >> On 22.06.2012, at 04:00, Rubén Pérez <[email protected]> wrote: >> >> Fernando, >> >> Currently the dictionary and OCR services don't support i18n very well. >> In fact, they don't give very good results with English either. See >> http://opencast.3480289.n2.nabble.com/OCR-Data-tp7580069p7580241.html . >> >> The main problem with tesseract is that the service doesn't really run >> tesseract with the correct parameters. tesseract accepts a "language" >> parameter which indicates which range of characters it should be expecting. >> Such parameter is not mandatory (it defaults to English), but obviously you >> need to specify it for any other language to get satisfactory results. >> Well, Matterhorn does NOT include such parameter and that's why it fails to >> detect Chinese ideograms. >> >> In Vigo we have, however, used a trick, so that tesseract uses the >> Spanish trained data instead of the English one: in the tesseract directory >> (normally /usr/local/share/tessdata) there is, at least, one file named >> "eng.traineddata", which tesseract uses by default. If you rename your >> Chinese .traineddata file to "eng.traineddata", then tesseract will use the >> Chinese characteres to detect the words. >> >> Re. the Chinese dictionary, I can just guess that the database has >> problems with the Chinese characters encoding, which use the highest codes >> in the standard (two bytes in length), while the occidental characters >> normally take the lowest (up to one byte). I wouldn't be surprised if the >> code assumed implicitly that all characters are 1-byte long, which >> obviously will break with UTF-8 2-byte-long characters, but I'm just >> guessing here and perhaps it's not the case. Feel free to file a bug, >> providing as much information as you can, if you cannot figure out why the >> Chinese dictionary isn't working (I believe there's not official Chinese >> dictionary in Matterhorn, so I'm assuming you created it yourself --you may >> as well include it in the ticket). >> >> Un saludo >> Rubén >> >> 2012/6/22 費納德費納德 <[email protected]> >> >>> Hello, >>> >>> we are trying to make matterhorh core server work with traditional >>> Chinese characters. But I can not find a way to achieve this target. Is >>> there a way to install Chinese traineddata with Tessearc engine? does >>> anybody know how to do this? Or does anybody succeeded installing another >>> language like Japanese or Simplified Chinese? >>> >>> If I am not wrong the Tesseract engine version installed is 3.0 so it >>> should support this feature. And another problem I found, how can I install >>> the Chinese dictionary? when I try to install it I get errors in the log >>> all the time, I suppose it is beacuse some issue with the character >>> codification. >>> >>> Regards, >>> >>> Fernando Hernandez >>> >>> _______________________________________________ >>> Matterhorn-users mailing list >>> [email protected] >>> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >>> >>> >> _______________________________________________ >> Matterhorn-users mailing list >> [email protected] >> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >> >> >> _______________________________________________ >> Matterhorn-users mailing list >> [email protected] >> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users >> >> >
_______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
