Re: [Opencast Matterhorn] OCR Data?

Rubén Pérez Mon, 04 Jun 2012 11:25:44 -0700

Hi Chris,

I sent one of my too-long-to-read (:P) emails to the list a while ago about
this very topic. At the University of Vigo we have been working hard to get
the OCR to work in Spanish, and even though we make some progress, I can
say with confidence that it's very hard, or rather impossible, to make the
OCR work reliably in the current conditions. In short, there are several
problems that prevent it from working as it should:


   1. *Dictionaries full of garbage*: One of the first things we have done
   was creating a new Spanish dictionary, since the one present at
   http://downloads.opencastproject.org/artifacts/ is full of "bad" words,
   characters illegal in Spanish and, in general, a lot of garbage that cannot
   be considered a word. We used the tool present in the dictionary service
   folder in the source code and then removed the words with non-alphanumeric
   characters in them. Of course, "alphanumeric" may mean different things
   depending on the language (for instance, á is a perfectly legal character
   in Spanish, while it's not in English).
   --> The English dictionary is not an exception. Some criterion should be
   used to filter out illegal words. That would result in a better dictionary,
   as it will be smaller (which means less resource consumption) and more
   accurate.
   2. *Word weights are (not) biased*: Every word in the dictionary has a
   weight, indicating their relative frequency. But it turns out that, quite
   often, the most relevant words in a presentation are those less common, so
   they are likely to be mistaken by other more frequent words. Some kind of
   mechanism should be implemented, so that 'keywords' could be specified for
   a recording, are given more probability in the detection.
   --> This probably requires modifying the current dictionary service, or
   creating a new one, to add such functionality.
   3. *I18n doesn't work*: This doesn't affect English, but any other
   language. Because tesseract is always run without parameters, which means
   "assume you are extracting text in English", words containing non-English
   characters are not correctly detected. Additionally, any character outside
   the range [a-zA-Z_0-9] is assumed to be a word boundary (roughly speaking),
   which leads to bad word segmentation, and those "fragments" are the words
   tested against the dictionary.
   --> To make the dictionary service really i18n-able, a deep refactoring
   should be done.
   --> Also, tesseract can be trained to improve character detection. This
   is part of the specific configuration of tesseract, but it has been
   completely overlooked in the documentation.


I'm willing to #propose and implement changes to the service, but I'm
saving it until we can get the current release(s) out. In the meanwhile,
I'm sorry to say, I wouldn't consider a working OCR like a real option,
unless a good amount of work is put into it locally.

As an example, you can take a look to the "Recorded Lectures" section in
tv.campusdomar.es . We made changes to the dictionary service halfway
through the semester, so you can see the difference between the first
recordings and the lastest. Still a lot of garbage, but at least we get
some meaningful word which can be used in searches.

I hope it helps, and I'm sorry if I'm too pessimistic on this. Perhaps
others have got better results and I would be glad to see them.

Best regards
Rubén

2012/5/30 Christopher Brooks <[email protected]>

> Jon,
>
> How about just public content that we could run through the OCR process
> for some test data?
>
> Chris
>
> > We have done very little testing of OCR at UC Berkeley.  We have been
> > focusing most of our efforts on having stable capture agents and on
> > content distribution.
> >
> > --
> > Jon
> >
> > On 5/30/12 1:08 PM, Christopher Brooks wrote:
> > > Maybe UCB folks have some as well?
> > >
> > > Or ETH (though, in english)?
> > >
> > > Chris
> > >
> > > On Wed, 30 May 2012 14:01:14 -0600
> > > Christopher Brooks<[email protected]>  wrote:
> > >
> > >> Hi,
> > >>
> > >> Is there anyone out there that has OCR working and some production
> > >> data from the system?  Alexandru this summer is working on using
> > >> OCR data to build a concept detection system.  The idea is that a
> > >> bunch of lectures could be aggregated into the high level
> > >> semantics that they deal with and a sort of concept map for the
> > >> course could be created.  But to make headway he's looking for
> > >> interesting case studies - does anyone have any production data
> > >> they can share?
> > >>
> > >> Presumably english would be the best.  Ruediger, I know you guys
> > >> have some deployment, is any in English?  Micah, maybe from UNL?
> > >>
> > >> I'll check around here too,
> > >>
> > >> Chris
> > >>
> > >
> > >
> > >
> > _______________________________________________
> > Matterhorn mailing list
> > [email protected]
> > http://lists.opencastproject.org/mailman/listinfo/matterhorn
> >
> >
> > To unsubscribe please email
> > [email protected]
> > _______________________________________________
>
>
>
> --
> Christopher Brooks, BSc, MSc
> ARIES Laboratory, University of Saskatchewan
>
> Web: http://www.cs.usask.ca/~cab938
> Phone: 1.306.966.1442
> Mail: Advanced Research in Intelligent Educational Systems Laboratory
>     Department of Computer Science
>     University of Saskatchewan
>     176 Thorvaldson Building
>     110 Science Place
>     Saskatoon, SK
>     S7N 5C9
> _______________________________________________
> Matterhorn mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn
>
>
> To unsubscribe please email
> [email protected]
> _______________________________________________
>

_______________________________________________
Matterhorn mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn


To unsubscribe please email
[email protected]
_______________________________________________

Re: [Opencast Matterhorn] OCR Data?

Reply via email to