Tobias,

I'd rather wait to see if we *really* get an improvement and then open a
ticket and share the changes. However I wouldn't say they're bugs *per se*,
but design decissions which are perfectly valid for English but not for
other languages. I'll give you an example: the set of what we could call
"alphanumeric characters" in the English language could be represented as
 [a-zA-Z_0-9] (and that's how it's defined in Java, for instance).However,
the set of Spanish "alphanumeric characters" is bigger:
[a-zA-Z_0-9áéíóúÁÉÍÓÚüÜñÑ]. Take French or Portuguese, which use other two
different graphic accents (` and ^), and there are even more symbols. That
must be taken into account when processing the words. One may think that it
doesn't matter if it's a n or an ñ, but if you think of the Spanish word
"cono" (cone) and the corresponding word with "ñ", it becomes obvious that
those differences are so important.

I don't intend to keep our "investigations" in the dark, but I'd prefer to
have something to show than just saying "hey, we think this may improve the
OCR", and then be wrong.

Regards
Rubén

2012/4/12 Tobias Wunden <[email protected]>

> Hi Ruben,
>
> would you mind sharing some details around the bugs you found and the
> improvements you are about to suggest? Maybe attach a patch to an open
> ticket?
>
> Thanks,
> Tobias
>
> On 12.04.2012, at 12:40, Rub駭 P駻ez <[email protected]> wrote:
>
> > Dear all,
> >
> > We are currently struggling with the text extraction, too, and we are
> seeing that Matterhorn is a little anglo-centric and does not like words
> with characters outside the [a-zA-Z_0-9] range. We are making some
> developments (partially thanks to Karen's advice --thanks!) but some of
> these involve changing some Java code and some design decisions which can
> be regarded as bugs. We want to test this thoroughly and perhaps we'll
> submit them for the 1.4 version, since this wouldn't be a new feature, but
> correcting something that is already in.
> >
> > Best regards
> >
> > 2012/4/12 Miguel Del Agua <[email protected]>
> > Thank you very much, but in my case captures seems to be OK. Anyway
> > the problem was due to some third party tools versions, and also due
> > to a incorrect dictionary loading. More info:
> >
> http://opencast.3480289.n2.nabble.com/How-to-improve-OCR-performance-tp7433198p7458735.html
> >
> > Regards,
> >
> > Miguel
> >
> >
> > 2012/4/5 費納德費納德 <[email protected]>:
> > > Hello Miguel,
> > >
> > > Take a look at the captures the workflow get form the video. In my
> case I
> > > get a grey pattern captures in 90% of the cases, so the OCR was not
> able to
> > > recognize almost anything. I solve it installing again ffmpeg and all
> the
> > > dependent packages. Now the OCR works almost perfect. But I have some
> issue
> > > with the ffmepg version, because recordings longer than 5 min I get
> errors
> > > during the video and audio mux. (With version 1.2 I didn't get these
> errors,
> > > only with 1.3. Maybe I install something in a different way).
> > >
> > > So I am not sure if you have this problem with the OCR but it is
> possible.
> > >
> > >
> > > Regards,
> > >
> > > Fernando Hernández Esguevillas.
> > >
> > > PD.- Si tienes alguna duda sobre como instalar la versión más reciente
> de
> > > ffmpeg me lo comentas y te paso algún link. Aunque es fácil encontrar
> la
> > > información en google. Un saludo.
> > >
> > > El 4 de abril de 2012 00:15, Miguel Del Agua <[email protected]>
> > > escribió:
> > >>
> > >> Hi,
> > >>
> > >> I just installed version 1.3 and seems to work correctly, but the OCR
> > >> performance is quite poor. I've tried to install a new dictionary as
> > >> it's said in the wiki but the performance still bad. So I would like
> > >> to know if it's possible to improve text recognition either by
> > >> changing some parameters of OCRopus or improving in some way the
> > >> dictionary.
> > >>
> > >> Thanks in advance.
> > >> _______________________________________________
> > >> Matterhorn-users mailing list
> > >> [email protected]
> > >> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
> > >
> > >
> > >
> > > _______________________________________________
> > > Matterhorn-users mailing list
> > > [email protected]
> > > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
> > >
> > _______________________________________________
> > Matterhorn-users mailing list
> > [email protected]
> > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
> >
> > _______________________________________________
> > Matterhorn-users mailing list
> > [email protected]
> > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
>
_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Reply via email to