Bug#774654: tesseract-ocr: unescaped ampersand in hOCR output

Jakub Wilk Mon, 05 Jan 2015 12:15:38 -0800

Package: tesseract-ocr
Version: 3.03.03-1
Control: affects -1 ocrodjvu

Tesseract sometimes produces hOCR with unescaped ampersand (making thewhole XHTML file ill-formed):


$ tesseract -l deu-frak test.png test hocr
Tesseract Open Source OCR Engine v3.03 with Leptonica
Too few characters. Skipping this page
OSD: Weak margin (0.00) for 0 blob text block, but using orientation anyway: 0

$ xmllint --valid test.hocr
test.hocr:16: parser error : EntityRef: expecting ';'
='word_1_2' title='bbox 211 15 276 64; x_wconf 87' lang='deu-frak' dir='ltr'>&c.
                                                                              ^


-- System Information:
Debian Release: 8.0
 APT prefers unstable
 APT policy: (990, 'unstable'), (500, 'experimental')
Architecture: i386 (x86_64)
Foreign Architectures: amd64

Kernel: Linux 3.2.0-4-amd64 (SMP w/2 CPU cores)
Locale: LANG=C, LC_CTYPE=pl_PL.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)

Versions of packages tesseract-ocr depends on:
ii  libc6                2.19-13
ii  libcairo2            1.14.0-2.1
ii  libgcc1              1:4.9.2-10
ii  libglib2.0-0         2.42.1-1
ii  libicu52             52.1-6
ii  liblept4             1.71-2.1+b2
ii  libpango-1.0-0       1.36.8-3
ii  libpangocairo-1.0-0  1.36.8-3
ii  libpangoft2-1.0-0    1.36.8-3
ii  libstdc++6           4.9.2-10
ii  libtesseract3        3.03.03-1
ii  tesseract-ocr-eng    3.02-2
ii  tesseract-ocr-equ    3.02-2
ii  tesseract-ocr-osd    3.02-2

--
Jakub Wilk

Bug#774654: tesseract-ocr: unescaped ampersand in hOCR output

Reply via email to