Re: [ocropus] OCRopus 0.5

Tom Morris Tue, 05 Jun 2012 07:28:27 -0700

On Sat, Jun 2, 2012 at 5:24 PM, Tom <[email protected]> wrote:
> OCRopus 0.5 was released a few weeks ago on Google Code.  There are a lot of
> changes relative to older versions:
>
>
> - OCRopus has been completely refactored and now consists of a set of Python
> modules, with some native code modules.
>
> - Unicode and ligature support should be fully working now.
>
> - Language modeling still uses finite state transducers, but all finite
> state transducer code has been refactored into ocrofst.
>
> - There is a completely new recognizer that performs much better than the
> old recognizer and scales to millions of training samples.
>
> - Databases for training/testing have been changed from SQLite format to
> HDF5 (using PyTables).
>
> - You can pull over everything you need for an install using a single
> command ("hg clone https://code.google.com/p/ocropus";)
>
>
> There are some videos on Google showing installation and training:
>
>     http://www.youtube.com/playlist?list=PL8B1A3C55DD915896&feature=mh_lolz
>
> There is also some additional documentation here:
>
>
>   https://docs.google.com/a/iupr.com/document/d/1RxXeuuYJRhrOkK8zcVpYtTo_zMG9u6Y0t4s-VoSVoPs/edit
>
> Image preprocessing and layout analysis are still basically the old versions
> from OCRopus.  They are still fairly sensitive to noise and will be replaced
> in future releases.


Congratulations on the release!  It's great to see progress being made.

For anyone who wants to install on earlier versions of Ubuntu, you can
find the necessary edits for the package names in my repository
http://code.google.com/r/tfmorris-ocropus-ubuntu-11-10-install-fixes/source/checkout

Where does this version of the code stand relative to production
quality code?  Is it getting close or still a long way or ...?  I know
that much of the recent effort has been put into
refactoring/reimplementation, but I can't see from the web site what
the overall plan is and how much work is left.

One reason that I'm asking is that a crude comparison to tesseract
seems to indicate that ocropus, in its current state, requires an
order of magnitude more resources without any improvement in
recognition accuracy.  The FST stage of the processing, in particular,
seems incredibly resource heavy without really doing much improvement
of the raw text generated by earlier stages.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en.

Re: [ocropus] OCRopus 0.5

Reply via email to