There are two main line recognizers: the C++-based one and the Python- based one. They work similarly, but differ significantly in their details.
The C++-based recognizer is implemented by the code in ocr-line/ linerec.cc It uses classes implementing IFeatureMap for extracting features; such classes extract features for the whole text line at once. If you use SimpleFeatureMap, you can configure which features to extract through the command line. We've done extensive benchmarks comparing different kinds of feature extraction and normalization strategies, but there is no clearcut best answer when it comes to the line recognition level. Generally, gradients, skeletal features, and holes are good features to use. In addition, you can also set a per-character feature extractor (IExtractor) in any instance of IModel. This is the way the Python- based line recognizers do feature extraction. For this, feature extraction happens separately for each character. The same kinds of extraction are available; the StandardExtractor also extracts gradients, skeletal features, and holes. The IExtractor also takes care of size and per-character slant normalization (if any) and some noise removal. The IFeatureMap approach is faster, but it's less flexible, which is why we're moving to the IExtractor model. Tom On Jul 2, 3:33 pm, afsina <[email protected]> wrote: > Hello > > We already have binarized, line and word segmented images available > for training. What are the default feature extraction mechanisms in > OCRopus and how can i utilize them? What is the default way of > training the model using those data? Any information is much > appreciated because documentation is really scarce. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
