On Thursday, May 2, 2013 1:05:15 PM UTC+2, Andreas Romeyke wrote: > Hello Tom, > > Thanks for your answer. > > OCRopus 0.7 doesn't need to be trained with individual characters, so you > don't really need the Tesseract training files. But you should be able to > use the scans that those files were derived from easily. > > Hmm, Not really. Because my tesseract training pages are not splitted up > in pages of single lines. Or could I train ocropus with a whole page and > corresponding text? The thing is, I would use a set of training pages > without specific modifications for tesseract and ocropus, too. >
The basic training for OCRopus is text lines and corresponding transcriptions. > It should support long-s, but it doesn't encode it separately in the > output. > > That is a problem. I need the correct encoding of long-s. I want preserve > the character 'ſ' in output. It should not be substituted with 's'. Same > for »«, „“ and so on. But that should not be a problem if I train my own > models, right? > Yes, you can train your own models, but you need to generate ground truth containing that information. We don't usually do that because different sources treat these cases differently, so if we want to maximize training data, we just use the lowest common denominator text normalization. Tom -- You received this message because you are subscribed to the Google Groups "ocropus" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ocropus/d2166564-faae-4f7b-beb7-05beb952b9cf%40googlegroups.com?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
