Hi All,
         I am trying to build a ZipCode recognizer for a set of postal
images using OCRopus.I have been using dict2linefst to create a custom
language model for the images
In doing so I wanted to clarify the following:
1.I have observed in the dict-costs , each word is associated with a
corresponding weight which is used internally to build a WFST using
extended openfst. Are there any guidelines for assigning these weights
in the dictionary?
2.During OCRopus decoding , the default character model does not yield
any result while using it with my custom LM.On the other hand if I use
2m2-reject.cmodel (found under ocropy) I am able to decode the image
by force aligning the output with the LM.Any inputs as to why this
should be the case?

Below is the logged output:

$ ocropus-pages --langmod=Custom.fst 000005us1343.tif
[note] line recognizer: <ocrolib.common.RecognizeLine instance at
0x9b5bd0c>
[note] *** 1 000005us1343.tif[0] ***
[info] got 40 bboxes
[info] all = 0
[note] 000005us1343.tif[0] lines: 4
[warn] beam search didn't find a solution (line not in language model)
[warn] beam search didn't find a solution (line not in language model)
[warn] beam search didn't find a solution (line not in language model)
$ ocropus-pages --langmod=Custom.fst --linerec=2m2-reject.cmodel
000005us1343.tif
[note] line recognizer: <ocrolib.common.CmodelLineRecognizer instance
at 0x95e0dac>
[note] *** 1 000005us1343.tif[0] ***
[info] got 40 bboxes
[info] all = 0
[note] 000005us1343.tif[0] lines: 4
146.50   12
126.00    7
 82.62   19     Newington ct 06111-

3.Though I am able to obtain the results by the above approach the
accuracy is poor(due to constraint LM , incorrect City, Zipcode
combinations are dumped as a result).Do I need to train a cmodel on
these set of images using ocropus-calign or is there a way of
optimizing it without training?

4.Also, is there a way of assigning weights to the character
model(recognizer) and LM in OCRopus , for example as done on Automatic
Speech Recognition side with the use of different weights for acoustic
models and language models?

Any help on the above will be greatly appreciated.

Regards,
Amrit.

On Jan 2, 8:54 am, Tom <[email protected]> wrote:
> You need to build a language model.  Download the PyOpenFST project
> from Google, then look in the "scripts" subdirectory.  There are a
> bunch of scripts for building language models, including dict2linefst.
>
> We're currently benchmarking a whole bunch of standard language models
> (n-grams, n-graphs, various smoothing and back-off strategies); I hope
> we'll have a report on that in a few months.
>
> (Note that the default recognizer has been trained only on UNLV and
> does not perform all that well on other datasets.)
>
> Tom
>
> On Jan 2, 5:54 am, Benjamin Lambert <[email protected]> wrote:
>
>
>
>
>
>
>
>
>
> > Hi all,
>
> > Let's see, I'm running the latest version controlled OCRopus  (at least 
> > within the last couple weeks), on Ubuntu.  It seems to be working.  My 
> > question is:
> > is there some way to specify a dictionary to the recognizer?
>
> > For recognition, I'm getting output that looks like this:
> > "|nd tl1e results of hi$ inVeStigations were pulfli6lled in l8815 # # y]"
>
> > I'd like to be able to specify the set of words that can be recognized, and 
> > have that not include strings like "hi$" and "pulfli6lled".  Is that 
> > possible in OCRopus?
>
> > Best,
> > Ben
>
> > --
> > Benjamin Lambert
> > Ph.D. Student of Computer Science
> > Carnegie Mellon Universitywww.cs.cmu.edu/~belamber
> > Mobile: 617-869-1844
>
> On Jan 2, 5:54 am, Benjamin Lambert <[email protected]> wrote:
>
>
>
>
>
>
>
> > Hi all,
>
> > Let's see, I'm running the latest version controlled OCRopus  (at least 
> > within the last couple weeks), on Ubuntu.  It seems to be working.  My 
> > question is:
> > is there some way to specify a dictionary to the recognizer?
>
> > For recognition, I'm getting output that looks like this:
> > "|nd tl1e results of hi$ inVeStigations were pulfli6lled in l8815 # # y]"
>
> > I'd like to be able to specify the set of words that can be recognized, and 
> > have that not include strings like "hi$" and "pulfli6lled".  Is that 
> > possible in OCRopus?
>
> > Best,
> > Ben
>
> > --
> > Benjamin Lambert
> > Ph.D. Student of Computer Science
> > Carnegie Mellon Universitywww.cs.cmu.edu/~belamber
> > Mobile: 617-869-1844

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en.

Reply via email to