There are two main line recognizers: the C++-based one and the Python-
based one.  They work similarly, but differ significantly in their
details.

The C++-based recognizer is implemented by the code in ocr-line/
linerec.cc  It uses classes implementing IFeatureMap for extracting
features; such classes extract features for the whole text line at
once.  If you use SimpleFeatureMap, you can configure which features
to extract through the command line.  We've done extensive benchmarks
comparing different kinds of feature extraction and normalization
strategies, but there is no clearcut best answer when it comes to the
line recognition level.  Generally, gradients, skeletal features, and
holes are good features to use.

In addition, you can also set a per-character feature extractor
(IExtractor) in any instance of IModel.  This is the way the Python-
based line recognizers do feature extraction.  For this, feature
extraction happens separately for each character.  The same kinds of
extraction are available; the StandardExtractor also extracts
gradients, skeletal features, and holes.  The IExtractor also takes
care of size and per-character slant normalization (if any) and some
noise removal.

The IFeatureMap approach is faster, but it's less flexible, which is
why we're moving to the IExtractor model.

Tom

On Jul 2, 3:33 pm, afsina <[email protected]> wrote:
> Hello
>
> We already have binarized, line and word segmented images available
> for training. What are the default feature extraction mechanisms in
> OCRopus and how can i utilize them? What is the default way of
> training the model using those data? Any information is much
> appreciated because documentation is really scarce.

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en.

Reply via email to