I prepared a few sample PNG files including Polish-language text using 
different TeX fonts. I processed them with OCRopus and I stated the program 
ignores all diacritic characters replacing them with the similar ASCII 
characters. For example the phrase: "pójdź kińże tę chmurność w głąb 
flaszy" is rendered as: "pdjd2 kih2e tg ChmurnosC W glqb flaszy". 


 I read a little about the previous OCRopus versions using the Tesseract 
program and I learned that UTF-8 recognition was one of the biggest 
advantages of these applications. The new OCRopus is poorly documented as 
yet so I don't know why OCRopus ignores UTF-8 encoded characters.


 I use the simple 'ocropus file.png' command. What should I do in order to 
allow OCRopus to use UTF-8? Maybe I should use some switch in the command 
line? Or maybe I should learn OCRopus the TeX fonts? Or maybe I should 
install some additional packages? 


 I have no idea what could I do. Every help will be welcomed.


-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msg/ocropus/-/MSEB3yuwufcJ.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to