Tom, are you still working on OCR projects? if so please contact us to have a potential consulting opportunity
On Friday, May 14, 2010 4:08:38 PM UTC-4, Tom wrote: > > > > My original understanding was the OCRopus was using the Tesseract > > > recognition engine and was focusing on higher order issues like page > > > segmentation/layout analysis, system integration, etc, but more > > > recently I believe the Tesseract recognition engine has been replaced > > > with either one built from scratch or one derived from a different > > > source. Is this an accurate summary? > > Yes, roughly. We couldn't use straight Tesseract because it didn't > work well on isolated lines, so we added some wrappers around it that > allowed it to do so. These wrappers are broken now because the > Tesseract APIs have changed. In Tesseract 3.0, there are new APIs > that are supposed to be stable, so we will be building new interfaces > to Tesseract 3.0 when we have that. At that point, you can choose > again between Tesseract and the built-in OCRopus recognizers. > > > > Does anyone have a block diagram of the processing pipeline with the > > > alternatives available for each stage in the pipe? Even better, one > > > which includes an analysis of the strengths and weakness of the > > > components relative to each other? (languages supported, error rates, > > > etc) > > You can get a list of all OCRopus components with the "ocropus > components" command. The major components are: > > ICleanupGray -- image preprocessing (default: StandardPreprocessing) > ISegmentPage -- page layout analysis (default: SegmentPageByRAST) > IRecognizeLine -- text line recognition (linerec; decided during > training) > IGenericFst -- language modeling (functionally, all implementations > are the same; you build language models with pyopenfst) > > For each component, you can modify parameters. For example, > > ocropus-pages -P SegmentPageByRAST:gap_factor=10 ... > > (You can get usage information for the Python commands with the "-h" > argument: "ocropus-pages -h") > > will run ocropus-pages with an instance SegmentPageByRAST and the > gap_factor set to 10. You can see all the available parameters with > "ocropus params SegmentPageByRAST". > > For ICleanupGray and ISegmentPage, there are a few useful alternatives > and useful changes to parameter settings, since preprocessing and > segmentation are the most common sources of recognition problems. To > see what is happening during those stages, you can run ocropus- > binarize, ocropus-pseg, and ocropus-pages with the "-d" argument, > which will show you the output of binarization and/or segmentation. > > IRecognizeLine is not settable in the recognizer because it is simply > a property of the model that you load. Once we have an interface to > Tesseract 3.0, you will be able to just load Tesseract for that > component. > > The language models are generated in PyOpenFST; have a look at ocropus- > linefst and ocropy.fstutils.load_text_file_as_fst for a simple example > of how to construct those. > > If you want to see how all the components play together, have a look > at ocropus-pages; it is fairly well commented now. > > If you want to see how the line recognizer itself works, have a look > at ocropy/simplerec.py; again, it is fairly well commented. However, > that's the Python version of the line recognizer, which still lacks > some important functionality (statistical space models, size models) > that are in the C++ recognizer. > > Tom > > -- > You received this message because you are subscribed to the Google Groups > "ocropus" group. > To post to this group, send email to [email protected] <javascript:>. > > To unsubscribe from this group, send email to > [email protected] <javascript:>. > For more options, visit this group at > http://groups.google.com/group/ocropus?hl=en. > > -- You received this message because you are subscribed to the Google Groups "ocropus" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ocropus/95c9ca79-5429-4246-a7d3-caef7df8f035%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
