On Thursday, May 2, 2013 1:05:15 PM UTC+2, Andreas Romeyke wrote:

> Hello Tom,
>
> Thanks for your answer.
>
> OCRopus 0.7 doesn't need to be trained with individual characters, so you 
> don't really need the Tesseract training files. But you should be able to 
> use the scans that those files were derived from easily.
>
> Hmm, Not really. Because my tesseract training pages are not splitted up 
> in pages of single lines. Or could I train ocropus with a whole page and 
> corresponding text? The thing is, I would use a set of training pages 
> without specific modifications for tesseract and ocropus, too.
>

The basic training for OCRopus is text lines and corresponding 
transcriptions. 

 

>   It should support long-s, but it doesn't encode it separately in the 
> output.
>
> That is a problem. I need the correct encoding of long-s. I want preserve 
> the character 'ſ' in output.  It should not be substituted with 's'. Same 
> for »«, „“ and so on. But that should not be a problem if I train my own 
> models, right?
>

Yes, you can train your own models, but you need to generate ground truth 
containing that information. We don't usually do that because different 
sources treat these cases differently,  so if we want to maximize training 
data, we just use the lowest common denominator text normalization.

Tom 

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ocropus/d2166564-faae-4f7b-beb7-05beb952b9cf%40googlegroups.com?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to