[ocropus] Re: OCRopus 0.5.4 and UTF-8 encoding

c.kruk Wed, 25 Jul 2012 16:08:22 -0700

 

Thank you for the information, Tom.




 Waiting for the answer I tried Tesseract and compared it to OCRopus. Here 
is the report about my experiences and conclusions...



 I prepared the LaTeX template file that includes the text samples in five 
languages – English, French, German, Polish, and Russian – and that uses 
seven different TeX families, series, and shapes of the fonts in 10 pt size 
– rm, sf, tt, bf, it, sl, and sc. I prepared also the shell script that 
generates on the basis of the mentioned LaTeX template the separate LaTeX 
files for each language, process them to DVI files, converts DVI files to 
GIF ones, and finally runs Tesseract on these GIF files. The script 
prepares by default the GIF files using seven different resolutions – from 
150 dpi to 450 dpi.



 Then I ran the script and started to compare the results. The overall 
result is the best in the case of 300 dpi resolution though some characters 
are recognized better in the other resolutions. It isn’t surprise that the 
best results gives the processing of the English-language text. The results 
for German are slightly worse than for English. The results for Polish and 
Russian are slightly worse than for German. The results for French are much 
worse than the results for Polish and Russian. 



 As for the results for Polish and Russian they aren’t the same. The 
quality of the analysis of the text in those languages is similar but the 
text in Latin as well as some numbers neighboring with the Russian text are 
recognized by Tesseract as a Russian text. For example the string 
“OHamburgefonsz” was recognized as “ОНатЬиг5е1соп$2” or “ОНаШЬиг3еҐопЅ2”, 
or something else; the number “6” was sometimes recognized as Russian 
letter “б”; and in the worst case of tt font the string “1234567890” was 
recognized as “12з45в7зэо”. 



 Because you started to train OCRopus with the German language I’ll show 
you the results of the analysis of the German text produced by Tesseract.



 I used the following text:



 FALSCHES ÜBEN VON XYLOPHONMUSIK QUÄLT JEDEN GRÖSSEREN ZWERG.

falsches üben von xylophonmusik quält jeden grösseren zwerg.

OHamburgefonsz 1234567890 !@*()=+[]|;:,./? - #$%&_{} ~^ <> \ " ` ' ‘’«»“„”



 Tesseract analyzing the 300 dpi GIF using seven mentioned TeX fonts 
produced the following result:



 lrm

FALSCHES ÜBEN VON XYLOPHONMUSIK QUALT JEDEN GRÖSSEREN ZWERG.

falsches üben von Xylophonmusik quält jeden grösseren Zwerg.

()Han1burgef0nsZ 1234567890 l@*():+[]|;:,./? - #$%&_{} M <> \ " ` ' “<<›>“„”

2sf

FALSCHES UBEN VON XYLOPHONIVIUSIK QUALT JEDEN GROSSEREN ZWERG.

falsches üben von xylophonmusik quält jeden grösseren Zwerg.

OHamburgefonsz 1234567890 !©*()=+[]|;:,./? - #$%&_{} M <> \ " ` ' "<<>>“„"

3tt

FALSCHES ÜBEN VON XYLOPHONMUSIK QUALT JEDEN GRÜSSEREN zwERG.

falsches üben von xylophonmusik quält jeden grösseren zwerg.

onamburgefonsz 1234567890 a@*()=+[]|;=,./? _ #$'7„&_{} "^ <> \ " ` ' 
“<<›>“„”

4 bf

FALSCHES ÜBEN VON XYLOPHONMUSIK QUÄLT JEDEN GRÖSSEREN ZWERG

falsches üben von xylophonmusik quält jeden grösseren Zwerg.

OHamburgefonsz 1234567890 !@*()=+[]|;:,./7 f #$%&_{} M <> \ " ` ' ”<<››“„”

5 it

FALSCHES UBEN VON XYLOPHONMUSJK QUALT JEDEN GROSSEREN ZWERG.

falsches üben von xylophonmusik quält jeden grösseren zwerg.

OHamburgef0nsz 1234567890 /@*():+[//;:,./? - #$%@§_ M <> \ ” ` ' ”<«››“„”

6 sl

FALSCHES UBEN VON XYLOPHONMUSIK QUALT JEDEN GROSSEREN ZWERG.

falsches üben von Xylophonmusík quält jeden grösseren Zwerg.

OHa1nburgef0nsz 1234567890 .'@*():+[]/;:,./? - #$%&_ M <> \ " ` ' ”<<›>“„”

7sc

FALSCHES ÜBEN VON XYLOPHONMUSIK QUÄLT JEDEN GROSSEREN ZWERG.

FALSCHES UBEN VON XYLOPHONMUSIK QUALT JEDEN GROSSEREN ZWERG.

OHAMBURGEEONSZ 1234567890 !@*():+[]|;:,./? - #$%&:_{} “^ <> \ " ` ' “<<>>“„”



 The hardest part is of course the string of the punctuation marks – it is 
partially language dependent because Tesseract used with the different 
training data interprets that string in a different way. The string 
“OHamburgefonsz” was recognized without any errors just two times – it is 
partially language dependent too (Tesseract used on most of the languages 
except for the Russian recognized that string properly for the three 
times). As for the text in German the biggest problem are uppercase and 
capital letters using diacritical marks. 



 At present from my point of view Tesseract has six advantages over OCRopus:



 1. It understands a bunch of languages including Polish while OCRopus is 
in the phase of the training with German.



 2. It understands different fonts without training while OCRopus fails 
spectacularly with unknown fonts – because that requires more explanation I 
put it next to that list.



 3. It works much faster (for example Tesseract processed alice.png file in 
1.8 sec. while OCRopus did it in 1 min. 42.6 sec.)



 4. It works always (for example Tesseract processed properly GIF and PNG 
files prepared with dvigif and dvipng from the same DVI file while OCRopus 
aborted in both cases – it displayed “ValueError: cannot convert float NaN 
to integer” error message in the case of GIF file and “AssertionError: 
input image is not binary” error message in the case of PNG file).



 5. It can be installed in different distributions – I use Slackware – 
while OCRopus uses the install script that is customized to use with Debian 
and its derivatives so in order to test it I had to switch the system to 
Linux Mint. 



 6. It works slightly better than OCRopus even with the simple text such as 
provided in the alice.png sample file (Tesseract recognized the text 
without flaws while OCRopus changed some “w” lowercases into “W” 
uppercases).



 As for the font recognition I tested both Tesseract and OCRopus on the 
same English text written with the seven mentioned above TeX fonts:



 THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.

the quick brown fox jumps over the lazy dog.

OHamburgefonsz 1234567890 !@*()=+[]|;:,./? - #$%&_{} ~^ <> \ " ` ' ‘’«»“„”



 Tesseract recognized it as:



 lrm

THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.

the quick brown fox jumps over the lazy dog.

OHamburgef0nsZ 1234567890 !@*():+[]|;:,./? — #$%&_{} H <> \ " \ ' "<<>>“,,”



 OCRopus did a complete mess of it:



 lHE dflICR BHO/AZ LOX TflJIL8 OAEH lHE IvNz Doc.

lHE (SCICR BHOH./ EO2 iCyIb8 OAEH lHE FvNz Doc.

i RC

'it; a o ff 'a

o[UmpnL8ceIlRx Ii292-) i(U)m(]=-ii(:.'., .- :Rrr ---- \<r /xi/if'abf,;'



 I believe the future OCRopus versions will be better and better so I’ll 
follow the development of that program.



 Thank you for your assistance once again.



-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msg/ocropus/-/IzDHVW_xAvcJ.
For more options, visit https://groups.google.com/groups/opt_out.

[ocropus] Re: OCRopus 0.5.4 and UTF-8 encoding

Reply via email to