--- Begin Message ---
Package: ocrodjvu
Version: 0.7.9-1
Severity: normal
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi,
I already reported the same problem on gscan2pdf. Tesseract tends to produce non
utf-8 characters from time to time. I tried only german (deu) so far. Even if
that seems to be an error with tesseract, it would be good, if ocrodjvu could
continue working.
first exception without --html5 option
Traceback (most recent call last):
File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread
result = self.process_page(page)
File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page
page_size=size
File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text
return self._hocr.extract_text(stream, **kwargs)
File "/usr/share/ocrodjvu/lib/hocr.py", line 434, in extract_text
scan_result = scan(doc.find('/body'), settings)
File "/usr/share/ocrodjvu/lib/hocr.py", line 366, in scan
for zone in _scan(node, settings, settings.page_size):
File "/usr/share/ocrodjvu/lib/hocr.py", line 223, in _scan
return get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
result += _scan(child, settings, page_size)
File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
children = get_children(node)
File "/usr/share/ocrodjvu/lib/hocr.py", line 191, in get_children
if node.text:
File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__
(src/lxml/lxml.etree.c:37022)
File "apihelpers.pxi", line 691, in lxml.etree._collectText
(src/lxml/lxml.etree.c:16626)
File "apihelpers.pxi", line 1344, in lxml.etree.funicode
(src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1: invalid
start byte
and with html5 option:
Traceback (most recent call last):
File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread
result = self.process_page(page)
File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page
page_size=size
File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text
return self._hocr.extract_text(stream, **kwargs)
File "/usr/share/ocrodjvu/lib/hocr.py", line 416, in extract_text
doc = html5_support.parse(stream)
File "/usr/share/ocrodjvu/lib/html5_support.py", line 24, in parse
namespaceHTMLElements=False
File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 38, in parse
return p.parse(doc, encoding=encoding)
File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 211, in
parse
parseMeta=parseMeta, useChardet=useChardet)
File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 111, in
_parse
self.mainLoop()
File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 174, in
mainLoop
self.phase.processCharacters(token)
File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 948, in
processCharacters
self.tree.insertText(token["data"])
File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/_base.py", line 288,
in insertText
parent.insertText(data)
File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree_lxml.py", line
225, in insertText
builder.Element.insertText(self, data, insertBefore)
File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114,
in insertText
self._element.text += data
File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__
(src/lxml/lxml.etree.c:37110)
File "apihelpers.pxi", line 721, in lxml.etree._setNodeText
(src/lxml/lxml.etree.c:16855)
File "apihelpers.pxi", line 1366, in lxml.etree._utf8
(src/lxml/lxml.etree.c:22060)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes
or control characters
I checked, that the corresponding html files did indeed contain non utf8
characters.
Best regards,
Thomas Koch
- -- System Information:
Debian Release: wheezy/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)
Kernel: Linux 3.2.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages ocrodjvu depends on:
ii djvulibre-bin 3.5.25.2-4
ii python 2.7.2-10
ii python-argparse 1.2.1-2
ii python-djvu 0.3.9-1
ii python2.7 [python-argparse] 2.7.3~rc2-2.1
Versions of packages ocrodjvu recommends:
ii ocropus <none>
ii python-lxml 2.3.2-1
ii python-pyicu 1.3-1
ii tesseract-ocr 3.02.01-4
Versions of packages ocrodjvu suggests:
pn cuneiform <none>
pn gocr <none>
pn ocrad <none>
- -- no debconf information
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAEBCAAGBQJPpslmAAoJEAf8SJEEK6Za6sgQALourAgqH2xqSPjFpvVoYeZ+
k65JAdaIC2rEYoOF4hSt5wt/fIyAWMLZlaZX3pFkroVazfcyqFK2lYbmiT0x+q/9
XDuKqoaxISmCZ7QF1YJqGNnT36s98HW5VP6aupSezETCCyZOqgLd+aXVkwFi7NDW
9ItKvzWySfIU3HzOF1xxjipJYu9/698rb+DBUUd0ilmIdLJx+x3wT+gFBWbA5xJr
24m3kdJrox86zKANBRzlznpSRUZNIGKjJS8Y5M/JDpKlK9NVYOgjRBwZoq0JAyLx
cgM4MYaUDxOw4BcTFCpwi5CjB9DDKcFhaCV2EeUtxur4pDrZs9sAGmDdqenfptap
Dt9fcaW9GFs8mNbnf9cjrOOtL1f9o2CDZBi2MQ5RdnwjIPpRH1jKxOYL0Mq44PUA
E2MqaVdRU1at009luvLVy/PQntzZrualByzcboOEkh7TUjjfQSjBc7k3piKwie1b
q+wRbNu0Ifz5jJqTzKRk2pdNOviJTWuV3LlbMdM0l0pHbpumFs0lunUaulvTS+zy
ttG52UpC2m+6ngJ81v+cbmv3uXF3N8Kp7LwlcyKxgROcKi8T+y30HfUjRUrhna2B
xxaUtaCysSmc84pS3ps29XHv7B6TwQQ9kyV7d/nf28r9CmDMevn6S+Wz8KDFAAeL
EvOww4b8vJ7IOS66394u
=WLq+
-----END PGP SIGNATURE-----
--- End Message ---