Hi Chris, I sent one of my too-long-to-read (:P) emails to the list a while ago about this very topic. At the University of Vigo we have been working hard to get the OCR to work in Spanish, and even though we make some progress, I can say with confidence that it's very hard, or rather impossible, to make the OCR work reliably in the current conditions. In short, there are several problems that prevent it from working as it should:
1. *Dictionaries full of garbage*: One of the first things we have done was creating a new Spanish dictionary, since the one present at http://downloads.opencastproject.org/artifacts/ is full of "bad" words, characters illegal in Spanish and, in general, a lot of garbage that cannot be considered a word. We used the tool present in the dictionary service folder in the source code and then removed the words with non-alphanumeric characters in them. Of course, "alphanumeric" may mean different things depending on the language (for instance, á is a perfectly legal character in Spanish, while it's not in English). --> The English dictionary is not an exception. Some criterion should be used to filter out illegal words. That would result in a better dictionary, as it will be smaller (which means less resource consumption) and more accurate. 2. *Word weights are (not) biased*: Every word in the dictionary has a weight, indicating their relative frequency. But it turns out that, quite often, the most relevant words in a presentation are those less common, so they are likely to be mistaken by other more frequent words. Some kind of mechanism should be implemented, so that 'keywords' could be specified for a recording, are given more probability in the detection. --> This probably requires modifying the current dictionary service, or creating a new one, to add such functionality. 3. *I18n doesn't work*: This doesn't affect English, but any other language. Because tesseract is always run without parameters, which means "assume you are extracting text in English", words containing non-English characters are not correctly detected. Additionally, any character outside the range [a-zA-Z_0-9] is assumed to be a word boundary (roughly speaking), which leads to bad word segmentation, and those "fragments" are the words tested against the dictionary. --> To make the dictionary service really i18n-able, a deep refactoring should be done. --> Also, tesseract can be trained to improve character detection. This is part of the specific configuration of tesseract, but it has been completely overlooked in the documentation. I'm willing to #propose and implement changes to the service, but I'm saving it until we can get the current release(s) out. In the meanwhile, I'm sorry to say, I wouldn't consider a working OCR like a real option, unless a good amount of work is put into it locally. As an example, you can take a look to the "Recorded Lectures" section in tv.campusdomar.es . We made changes to the dictionary service halfway through the semester, so you can see the difference between the first recordings and the lastest. Still a lot of garbage, but at least we get some meaningful word which can be used in searches. I hope it helps, and I'm sorry if I'm too pessimistic on this. Perhaps others have got better results and I would be glad to see them. Best regards Rubén 2012/5/30 Christopher Brooks <[email protected]> > Jon, > > How about just public content that we could run through the OCR process > for some test data? > > Chris > > > We have done very little testing of OCR at UC Berkeley. We have been > > focusing most of our efforts on having stable capture agents and on > > content distribution. > > > > -- > > Jon > > > > On 5/30/12 1:08 PM, Christopher Brooks wrote: > > > Maybe UCB folks have some as well? > > > > > > Or ETH (though, in english)? > > > > > > Chris > > > > > > On Wed, 30 May 2012 14:01:14 -0600 > > > Christopher Brooks<[email protected]> wrote: > > > > > >> Hi, > > >> > > >> Is there anyone out there that has OCR working and some production > > >> data from the system? Alexandru this summer is working on using > > >> OCR data to build a concept detection system. The idea is that a > > >> bunch of lectures could be aggregated into the high level > > >> semantics that they deal with and a sort of concept map for the > > >> course could be created. But to make headway he's looking for > > >> interesting case studies - does anyone have any production data > > >> they can share? > > >> > > >> Presumably english would be the best. Ruediger, I know you guys > > >> have some deployment, is any in English? Micah, maybe from UNL? > > >> > > >> I'll check around here too, > > >> > > >> Chris > > >> > > > > > > > > > > > _______________________________________________ > > Matterhorn mailing list > > [email protected] > > http://lists.opencastproject.org/mailman/listinfo/matterhorn > > > > > > To unsubscribe please email > > [email protected] > > _______________________________________________ > > > > -- > Christopher Brooks, BSc, MSc > ARIES Laboratory, University of Saskatchewan > > Web: http://www.cs.usask.ca/~cab938 > Phone: 1.306.966.1442 > Mail: Advanced Research in Intelligent Educational Systems Laboratory > Department of Computer Science > University of Saskatchewan > 176 Thorvaldson Building > 110 Science Place > Saskatoon, SK > S7N 5C9 > _______________________________________________ > Matterhorn mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn > > > To unsubscribe please email > [email protected] > _______________________________________________ >
_______________________________________________ Matterhorn mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn To unsubscribe please email [email protected] _______________________________________________
