Thanks for writing that up. The same training procedure works with
the new commands as well (and the model you trained can also be loaded
and used with the new commands). The C++ recognizer you're using for
that has very good character classification performance if trained
like you did (this can also be done "on the fly" for per-book
adaptation).
This is perhaps a good place to explain what we have been doing since
on the recognizer. First, the existing C++ recognizer didn't handle
Unicode or large character sets. Furthermore, the remaining major
sources of error in that recognizer were not individual character
misclassifications, but ligatures ("fi"), unintentional ligatures
("as"), noise vs diacritics, x-height determination for outputting the
correct case ("oxz" vs "OXZ"), and space modeling ("l l" vs "ll").
(Language models also need improvements.)
These issues aren't conceptually hard, but it's a lot of work to
address all of them. We've made a lot of progress on them and are now
integrating all of that into the new recognizer. All the C++ code can
handle 32 bit codepoints now, and we're updating the Python code to
handle Unicode strings as well (this was actually the driver for using
a Python toplevel, since unicode support in both C++ and Lua is poor).
Both deliberate and accidental ligatures can now be trained and
recognized as units (so the classifier can actually output multi-
string results like "fi" and "as" directly), and we have solutions for
the other issues. We also have much better means of visualizing and
analyzing recognition results and errors (e.g., ocropus-showlrecs).
Separately, there will be several other recognizers as well. You will
be able to plug in Tesseract as a line recognizer again. And there
are a couple of segmentation-free HMM-based recognizers that will
hopefully make it into a future release (post-beta).
Tom
On May 14, 4:22 pm, Mike Bryant <[email protected]> wrote:
> Hi folks,
>
> I've belatedly got around to writing up a little experiment I did a
> while ago for training OCRopus to read a really weird font. It's in
> no way supposed to be taken as an example of real-world usage since
> everything was highly contrived. Like many people on here I've been
> attempting to train it to yield better results on Latin texts by using
> the default character model, altogether without much success. Still,
> it might be helpful.
>
> Caveats:
>
> - doesn't cover the newer training techniques (clustering,
> labelling?) that are apparently available in newest release
> - the version of OCRopus used was tip with parent 349:ef1e07e86895
> from Feb 23rd
>
> Here's the link:
>
> http://ocropodium.cerch.kcl.ac.uk/?p=82
>
> Also, people attempting training on "difficult" sources might be
> interested in a couple of tools I made for playing with line
> transcripts and character segmentation. They pretty rough and you'll
> have to build them from source, but again, might come in handy:
>
> http://code.google.com/p/ocropodium/
>
> Cheers,
> Mike
>
> --
> You received this message because you are subscribed to the Google Groups
> "ocropus" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group
> athttp://groups.google.com/group/ocropus?hl=en.
--
You received this message because you are subscribed to the Google Groups
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/ocropus?hl=en.