Re: training "i" using make_ConnectComponentSegmenter() method

Thomas Breuel Fri, 06 Feb 2009 06:30:29 -0800

>
> Dear Mezhirov,
>                       I need to know about the procedure to train *"i"*using 
> make_ConnectComponentSegmenter() method. For this when I provide
> *i* as the transcription of the image, then the output of the lua script
> (train-bpnet-lines.lua) is an error message which is :
>
> *narray: index out of range in function addTrainingLine (at
> train-bpnet-lines.lua file)*
>
> Well, I understand that as the connected component segmenter provide two
> colors for character image "i" so we should provide two transcription for
> this.



There are two cases you need to distinguish.

For training, you need a correct segmentation, plus the transcription.  The
only way to get a correct segmentation is by creating it by hand, or by
using an alignment procedure.  None of the segmentation methods in OCRopus
will give you a correct segmentation in general.

For recognition, you use one of the built in segmenters.  They generate an
oversegmentation.  CurvedCutSegmenter was designed for handwriting and works
passably well for printed Western languages.  It probably won't work well
for Bangla.  Connected component segmenter doesn't work well for anything
other than very clean printed Western fonts.  It is mostly there for control
experiments.


> However, at this moment I just want to know exactly what strategy you
> follow to training "i". In our script (Bangla) there are so many characters
> which have a disjoint shape and we need to fix a common strategy to train
> them that you are following to train *"i"*.


There are several different strategies you can use, and nobody knows what
the best one is.  You can divide characters into small parts and then train
each small part, giving you a fairly small number of characters, or you can
train larger pieces and have a larger character set.

However, none of the built in segmenters will likely work well for Bangla.
The CurvedCutSegmenter might work well, if you modify it to do right cuts
instead of left cuts (since you want to cut to the right of the vertical
lines).

The next version of OCRopus (soon) will have large character set training
support.  I think a simple segmentation plus large character set training
will be important.

Please see the discussion here:

http://sites.google.com/site/ocropus/languages/devanagari-hindi-sanskrit

So, the specific answer to your question is: if you want to train the letter
"i" as the letter "i", then you need to ensure that all its pixels have the
same color, and you need to transcribe it with exactly one character.

One more thing: there are many ways in which training can fail and throw an
exception.  Our code is exception safe, so if you get an exception in some
main loop, you can simply continue training or processing the next image.
There won't be any storage leak or undefined data structures (if there are,
it's a bug and please report it).

Tom

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: training "i" using make_ConnectComponentSegmenter() method

Reply via email to