[ocropus] Re: Tesseract vs OCRopus

support Fri, 24 Oct 2014 21:22:07 -0700

Tom, are you still working on OCR projects? if so please contact us to have 
a potential consulting opportunity


On Friday, May 14, 2010 4:08:38 PM UTC-4, Tom wrote:
>
> > > My original understanding was the OCRopus was using the Tesseract 
> > > recognition engine and was focusing on higher order issues like page 
> > > segmentation/layout analysis, system integration, etc, but more 
> > > recently I believe the Tesseract recognition engine has been replaced 
> > > with either one built from scratch or one derived from a different 
> > > source.  Is this an accurate summary? 
>
> Yes, roughly.  We couldn't use straight Tesseract because it didn't 
> work well on isolated lines, so we added some wrappers around it that 
> allowed it to do so.  These wrappers are broken now because the 
> Tesseract APIs have changed.  In Tesseract 3.0, there are new APIs 
> that are supposed to be stable, so we will be building new interfaces 
> to Tesseract 3.0 when we have that.  At that point, you can choose 
> again between Tesseract and the built-in OCRopus recognizers. 
>
> > > Does anyone have a block diagram of the processing pipeline with the 
> > > alternatives available for each stage in the pipe?  Even better, one 
> > > which includes an analysis of the strengths and weakness of the 
> > > components relative to each other? (languages supported, error rates, 
> > > etc) 
>
> You can get a list of all OCRopus components with the "ocropus 
> components" command.  The major components are: 
>
> ICleanupGray -- image preprocessing (default: StandardPreprocessing) 
> ISegmentPage -- page layout analysis (default: SegmentPageByRAST) 
> IRecognizeLine -- text line recognition (linerec; decided during 
> training) 
> IGenericFst -- language modeling (functionally, all implementations 
> are the same; you build language models with pyopenfst) 
>
> For each component, you can modify parameters.  For example, 
>
> ocropus-pages -P SegmentPageByRAST:gap_factor=10 ... 
>
> (You can get usage information for the Python commands with the "-h" 
> argument: "ocropus-pages -h") 
>
> will run ocropus-pages with an instance SegmentPageByRAST and the 
> gap_factor set to 10.  You can see all the available parameters with 
> "ocropus params SegmentPageByRAST". 
>
> For ICleanupGray and ISegmentPage, there are a few useful alternatives 
> and useful changes to parameter settings, since preprocessing and 
> segmentation are the most common sources of recognition problems.  To 
> see what is happening during those stages, you can run ocropus- 
> binarize, ocropus-pseg, and ocropus-pages with the "-d" argument, 
> which will show you the output of binarization and/or segmentation. 
>
> IRecognizeLine is not settable in the recognizer because it is simply 
> a property of the model that you load.  Once we have an interface to 
> Tesseract 3.0, you will be able to just load Tesseract for that 
> component. 
>
> The language models are generated in PyOpenFST; have a look at ocropus- 
> linefst and ocropy.fstutils.load_text_file_as_fst for a simple example 
> of how to construct those. 
>
> If you want to see how all the components play together, have a look 
> at ocropus-pages; it is fairly well commented now. 
>
> If you want to see how the line recognizer itself works, have a look 
> at ocropy/simplerec.py; again, it is fairly well commented.  However, 
> that's the Python version of the line recognizer, which still lacks 
> some important functionality (statistical space models, size models) 
> that are in the C++ recognizer. 
>
> Tom 
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "ocropus" group. 
> To post to this group, send email to [email protected] <javascript:>. 
>
> To unsubscribe from this group, send email to 
> [email protected] <javascript:>. 
> For more options, visit this group at 
> http://groups.google.com/group/ocropus?hl=en. 
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ocropus/95c9ca79-5429-4246-a7d3-caef7df8f035%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[ocropus] Re: Tesseract vs OCRopus

Reply via email to