No, no preprocessing is normally required. Whatever text you give it is simply used to determine the probabilities. Note that line breaks matter, since the model models the start of the line. Furthermore, contexts longer than 3-4 may cause the model to become too sparse (there is no back-off right now). The trickiest part in getting the language model to work is in finding the right weights for characters, language models, and whitespace (specified with command line parameters to ocropus-ngraphs during matching). They are a tradeoff between how well your documents match your corpus, document quality, and recognizer quality.
Tom On Thursday, August 23, 2012 10:51:56 PM UTC+2, Luciano Édipo wrote: > > I am creating a language model using OCRopus-ngraph, there must be some > pre-processing or preparation of the set of text used to generate the > model? Some indication about it? > -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/zke6H5y4MigJ. For more options, visit https://groups.google.com/groups/opt_out.
