Package: tesseract-ocr
Version: 3.03.03-1
Control: affects -1 ocrodjvu
Tesseract sometimes produces hOCR with unescaped ampersand (making the
whole XHTML file ill-formed):
$ tesseract -l deu-frak test.png test hocr
Tesseract Open Source OCR Engine v3.03 with Leptonica
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 0 blob text block, but using orientation anyway: 0
$ xmllint --valid test.hocr
test.hocr:16: parser error : EntityRef: expecting ';'
='word_1_2' title='bbox 211 15 276 64; x_wconf 87' lang='deu-frak' dir='ltr'>&c.
^
-- System Information:
Debian Release: 8.0
APT prefers unstable
APT policy: (990, 'unstable'), (500, 'experimental')
Architecture: i386 (x86_64)
Foreign Architectures: amd64
Kernel: Linux 3.2.0-4-amd64 (SMP w/2 CPU cores)
Locale: LANG=C, LC_CTYPE=pl_PL.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)
Versions of packages tesseract-ocr depends on:
ii libc6 2.19-13
ii libcairo2 1.14.0-2.1
ii libgcc1 1:4.9.2-10
ii libglib2.0-0 2.42.1-1
ii libicu52 52.1-6
ii liblept4 1.71-2.1+b2
ii libpango-1.0-0 1.36.8-3
ii libpangocairo-1.0-0 1.36.8-3
ii libpangoft2-1.0-0 1.36.8-3
ii libstdc++6 4.9.2-10
ii libtesseract3 3.03.03-1
ii tesseract-ocr-eng 3.02-2
ii tesseract-ocr-equ 3.02-2
ii tesseract-ocr-osd 3.02-2
--
Jakub Wilk