I am trying to build a zipcode recognizer for a set of postal > images using OCRopus.I have been using dict2linefst to create a custom > language model for the images > In doing so I wanted to clarify the following: > 1.I have observed in the dict-costs , each word is associated with a > corresponding weight which is used internally to build a WFST using > extended openfst. Are there any guidelines for assigning these weights > in the dictionary? . At present I am constructing my LM from a > dictionary which contains elements such as > 0.0, > 0.0,City State Zipcode > 0.0,City State Zipcode >
The dic-costs file is just a sample input to the wdict2wordfst script. However, none of those scripts are really meant for production use. You probably need to write your own script using the PyOpenFST library or the OpenFST tools in order to build a working language model. > 2.The default character model does not yield any result while using it > with my custom LM.On the other hand if I use 2m2-reject.cmodel (found > under ocropy/) I am able to decode the image by force aligning the > output with the LM.Any inputs as to why this should be the case? > The default is probably a line recognizer model; it may simply not recognize some of the characters at all and the way it constructs the FST may leave the segmentation graph unconnected. If you use a character model like 2m2-reject.cmodel, it uses the new character recognizer (CmodelLineRecognizer), which tries harder to keep the segmentation graph connected. Note that 2m2-reject.cmodel, while fairly good, doesn't take into account character geometry. There's a new set of character models being trained that does. You can see better what's going on with ocropus-showlrecs You can also look at the raw recognizer output with "fstdraw output.fst > output.dot; dotty output.dot" or something like that (have a look at the OpenFST documentation). Note that the preferred way of running OCRopus is *not* using ocropus-pages (that mainly serves to illustrate the different software components). Instead, run the sequence: ocropus-binarize ocropus-pseg ocropus-lattices ocropus-align (ocropus-lattices and ocropus-align replace ocropus-calign) 3.Though I am able to obtain the results by the above approach the > accuracy is poor(due to constraint LM , wrong City,Zipcode > combinations are dumped as results).Do I need to train a cmodel on > these set of images using ocropus-calign or is there a way of > optimizing it without training? > You probably need to train a new cmodel. > 4.Also, is there a way of assigning weights to the character > model(recognizer) and LM in OCRopus , for example as done in Automatic > Speech Recognition side with the use of different weights for acoustic > models and language models? > You can compile different weights into the language model, which amounts to the same thing. There is some code to do that at recognition time, but currently no command line option (there will be at some point to make recognition easier). > 5. Training with ocropus-calign fails. If I try and use ocropus-calign > as per the instructions : > http://groups.google.com/group/ocropus/browse_thread/thread/4f3a2ee1a.. > It always fails to find the ground truth file ( .gt.txt ) , any > suggestions? ( I am trying to build on the 2m2-reject.cmodel as the > character model) > Without more info, I can't tell what the problem is; but if it fails to find the ground truth file, it's probably in the wrong place. You can check with "strace -eopen ..." what files it is actually trying to open. Note that ocropus-calign has been replaced by ocropus-lattices + ocropus-align, and the options have changed. > 6.Any other suggestions which might prove effective in implementing > such a zip code recognizer using OCRopus as per your experience. > Recognition rates should be very good with a language model; I've built those kinds of models for handwriting recognition. However, you probably should build the model yourself, rather than using the Python scripts in pyopenfst (see above). Tom -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
