I prepared a few sample PNG files including Polish-language text using different TeX fonts. I processed them with OCRopus and I stated the program ignores all diacritic characters replacing them with the similar ASCII characters. For example the phrase: "pójdź kińże tę chmurność w głąb flaszy" is rendered as: "pdjd2 kih2e tg ChmurnosC W glqb flaszy".
I read a little about the previous OCRopus versions using the Tesseract program and I learned that UTF-8 recognition was one of the biggest advantages of these applications. The new OCRopus is poorly documented as yet so I don't know why OCRopus ignores UTF-8 encoded characters. I use the simple 'ocropus file.png' command. What should I do in order to allow OCRopus to use UTF-8? Maybe I should use some switch in the command line? Or maybe I should learn OCRopus the TeX fonts? Or maybe I should install some additional packages? I have no idea what could I do. Every help will be welcomed. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msg/ocropus/-/MSEB3yuwufcJ. For more options, visit https://groups.google.com/groups/opt_out.
