Hi Ruben, > I sent a mail to the committers list a while ago. It was a looong mail (I > know I can be too wordy most times), but I described in detail the problems I > found to i18n-ize the OCR and dictionary service.
Yes, I remember reading (all!) of it :-) > If I didn't create tickets or continued with that topic was because we are in > the middle of a release, which comes from another release delayed sine die. > Modifying the Dictionary Service, and the OCR, implies some decision making > and a lot of refactoring, which, I think, is out of scope for the time being. No problem. However, my point would be that if we had a ticket, we could refer people to that ticket instead of discussing repeatedly on list and (more importantly) it would be possible to track status and see whether work on this has begun or at least when work is sheduled to happen, even if it's later in the game (watching too much soccer right now). > I've got some ideas on the matter but I would like to discuss such > refactorization with the other developers/adopter who may be interested in > the topic so that we can make have an understanding of what's needed and how > it should be done. There are *lots* of things that can be done if there are > people and resources interested. This could even be an independent project, > such as OpenCaps used to be. I understand and like that finally we can start digging into the details of internationalization. However, I think that there are some easy steps that we could take first in order to improve the overall quality, and second I think it's important to indicate that the community is aware of the fact that there is a problem to solve, so opening a ticket and describing the misery should still be possible and should be the first thing. > I can, indeed, create that JIRA ticket and post (an excerpt of) my mail > there. Though I prefer the email for discussions, and then tickets for > specific task and organization. I think I'll create the ticket and resurface > that email at the same time. I hope this is OK. Exactly what I think should happen, please do! > Re. the language parameter: as far as I know, the reason of ignoring it was > that the dictionary service could detect presentations with several languages > in them, because, I think, it decides which is the language in a per-frame > basis. The problem is that it makes that decision using the text obtained for > tesseract, which needs to know the language beforehand. Good point. > If we want to limit the text detection to the language specified in the > mediapackage, and at the same time we want to keep the ability of having > multilingual presentations (this is a manner of speaking, we cannot "keep" > what is not already there --the text extraction never worked well for us nor > I know cases of success in this area), we need to find a way to specify > *several* languages in a presentation, and have tesseract analyse the text as > many times as languages specified, and somehow filter the best results for > one language and the other. That's extremely complicated from my point of > view. Perhaps we can consider making a first version that allows only one > language, and, when that is *really* working, improving it to support > multiple languages. > There is still the question on *how* the language is specified. Is it in > natural language (Spanish, German)? If so, which language should we use, the > language name in English or in the language itself (Español, Deutsch, Suomi)? > Or should we use a language code? And, if so, which one? (es_ES, es, spa, > etc.) That should have been addressed by the metadata working group (Olaf), but I have to admit that I have no clue what the status is. > (Hey, who says I'm wordy? :P) ... > > Anyway, I'll create the ticket and see if it draws some interest from the > adopters/developers. Thanks! Tobias _______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
