Bug#671764: marked as done (ocrodjvu crashes on non-utf8 from tesseract)

Debian Bug Tracking System Sun, 16 Feb 2014 01:25:23 -0800

Your message dated Sun, 16 Feb 2014 09:21:08 +0000
with message-id <[email protected]>
and subject line Bug#671764: fixed in ocrodjvu 0.7.17-1
has caused the Debian Bug report #671764,
regarding ocrodjvu crashes on non-utf8 from tesseract
to be marked as done.


This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
671764: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=671764
Debian Bug Tracking System
Contact [email protected] with problems

--- Begin Message ---

Package: ocrodjvu
Version: 0.7.9-1
Severity: normal

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi,

I already reported the same problem on gscan2pdf. Tesseract tends to produce non
utf-8 characters from time to time. I tried only german (deu) so far. Even if
that seems to be an error with tesseract, it would be good, if ocrodjvu could
continue working.

first exception without --html5 option

Traceback (most recent call last):
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread
    result = self.process_page(page)
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page
    page_size=size
  File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text
    return self._hocr.extract_text(stream, **kwargs)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 434, in extract_text
    scan_result = scan(doc.find('/body'), settings)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 366, in scan
    for zone in _scan(node, settings, settings.page_size):
  File "/usr/share/ocrodjvu/lib/hocr.py", line 223, in _scan
    return get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 194, in get_children
    result += _scan(child, settings, page_size)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 250, in _scan
    children = get_children(node)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 191, in get_children
    if node.text:
  File "lxml.etree.pyx", line 897, in lxml.etree._Element.text.__get__ 
(src/lxml/lxml.etree.c:37022)
  File "apihelpers.pxi", line 691, in lxml.etree._collectText 
(src/lxml/lxml.etree.c:16626)
  File "apihelpers.pxi", line 1344, in lxml.etree.funicode 
(src/lxml/lxml.etree.c:21864)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 1: invalid 
start byte

and with html5 option:

Traceback (most recent call last):
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 363, in page_thread
    result = self.process_page(page)
  File "/usr/share/ocrodjvu/lib/cli/ocrodjvu.py", line 343, in process_page
    page_size=size
  File "/usr/share/ocrodjvu/lib/engines/tesseract.py", line 216, in extract_text
    return self._hocr.extract_text(stream, **kwargs)
  File "/usr/share/ocrodjvu/lib/hocr.py", line 416, in extract_text
    doc = html5_support.parse(stream)
  File "/usr/share/ocrodjvu/lib/html5_support.py", line 24, in parse
    namespaceHTMLElements=False
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 38, in parse
    return p.parse(doc, encoding=encoding)
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 211, in 
parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 111, in 
_parse
    self.mainLoop()
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 174, in 
mainLoop
    self.phase.processCharacters(token)
  File "/usr/lib/pymodules/python2.7/html5lib/html5parser.py", line 948, in 
processCharacters
    self.tree.insertText(token["data"])
  File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/_base.py", line 288, 
in insertText
    parent.insertText(data)
  File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree_lxml.py", line 
225, in insertText
    builder.Element.insertText(self, data, insertBefore)
  File "/usr/lib/pymodules/python2.7/html5lib/treebuilders/etree.py", line 114, 
in insertText
    self._element.text += data
  File "lxml.etree.pyx", line 904, in lxml.etree._Element.text.__set__ 
(src/lxml/lxml.etree.c:37110)
  File "apihelpers.pxi", line 721, in lxml.etree._setNodeText 
(src/lxml/lxml.etree.c:16855)
  File "apihelpers.pxi", line 1366, in lxml.etree._utf8 
(src/lxml/lxml.etree.c:22060)
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes 
or control characters

I checked, that the corresponding html files did indeed contain non utf8 
characters.

Best regards,

Thomas Koch

- -- System Information:
Debian Release: wheezy/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages ocrodjvu depends on:
ii  djvulibre-bin                3.5.25.2-4
ii  python                       2.7.2-10
ii  python-argparse              1.2.1-2
ii  python-djvu                  0.3.9-1
ii  python2.7 [python-argparse]  2.7.3~rc2-2.1

Versions of packages ocrodjvu recommends:
ii  ocropus        <none>
ii  python-lxml    2.3.2-1
ii  python-pyicu   1.3-1
ii  tesseract-ocr  3.02.01-4

Versions of packages ocrodjvu suggests:
pn  cuneiform  <none>
pn  gocr       <none>
pn  ocrad      <none>

- -- no debconf information

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQIcBAEBCAAGBQJPpslmAAoJEAf8SJEEK6Za6sgQALourAgqH2xqSPjFpvVoYeZ+
k65JAdaIC2rEYoOF4hSt5wt/fIyAWMLZlaZX3pFkroVazfcyqFK2lYbmiT0x+q/9
XDuKqoaxISmCZ7QF1YJqGNnT36s98HW5VP6aupSezETCCyZOqgLd+aXVkwFi7NDW
9ItKvzWySfIU3HzOF1xxjipJYu9/698rb+DBUUd0ilmIdLJx+x3wT+gFBWbA5xJr
24m3kdJrox86zKANBRzlznpSRUZNIGKjJS8Y5M/JDpKlK9NVYOgjRBwZoq0JAyLx
cgM4MYaUDxOw4BcTFCpwi5CjB9DDKcFhaCV2EeUtxur4pDrZs9sAGmDdqenfptap
Dt9fcaW9GFs8mNbnf9cjrOOtL1f9o2CDZBi2MQ5RdnwjIPpRH1jKxOYL0Mq44PUA
E2MqaVdRU1at009luvLVy/PQntzZrualByzcboOEkh7TUjjfQSjBc7k3piKwie1b
q+wRbNu0Ifz5jJqTzKRk2pdNOviJTWuV3LlbMdM0l0pHbpumFs0lunUaulvTS+zy
ttG52UpC2m+6ngJ81v+cbmv3uXF3N8Kp7LwlcyKxgROcKi8T+y30HfUjRUrhna2B
xxaUtaCysSmc84pS3ps29XHv7B6TwQQ9kyV7d/nf28r9CmDMevn6S+Wz8KDFAAeL
EvOww4b8vJ7IOS66394u
=WLq+
-----END PGP SIGNATURE-----

--- End Message ---

--- Begin Message ---

Source: ocrodjvu
Source-Version: 0.7.17-1

We believe that the bug you reported is fixed in the latest version of
ocrodjvu, which is due to be installed in the Debian FTP archive.

A summary of the changes between this version and the previous one is
attached.

Thank you for reporting the bug, which will now be closed.  If you
have further comments please address them to [email protected],
and the maintainer will reopen the bug report if appropriate.

Debian distribution maintenance software
pp.
Jackson Doak <[email protected]> (supplier of updated ocrodjvu package)

(This message was generated automatically at their request; if you
believe that there is a problem with it please contact the archive
administrators by mailing [email protected])


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Format: 1.8
Date: Sun, 16 Feb 2014 07:48:51 +1100
Source: ocrodjvu
Binary: ocrodjvu
Architecture: source all
Version: 0.7.17-1
Distribution: unstable
Urgency: low
Maintainer: Daniel Stender <[email protected]>
Changed-By: Jackson Doak <[email protected]>
Description: 
 ocrodjvu   - tool to perform OCR on DjVu documents
Closes: 671764 672489 707806
Changes: 
 ocrodjvu (0.7.17-1) unstable; urgency=low
 .
   * Team upload.
 .
   [ Daniel Stender ]
   * New upstream release (closes: #671764, LP: #1108387).
   * Bumped dephelper to 9 (deb/control and deb/compat).
   * debian/changelog: extended copyrights to 2013.
   * debian/control:
     + bumped standards to 3.9.4 (no changes needed).
     + added html5lib to Recommmends (closes: #672489).
     + changed X-Python-Version to 2.6.
     + dropped unnecessary version from python-djvu dep.
 .
   [ Jakub Wilk ]
   * Use canonical URIs for Vcs-* fields.
   * Use "python-all (>= 2.7.3-5)" (that is, python-all that have only 2.7 as
     supported version) as an alternative to python-argparse in Build-Depends
     (closes: #707806). Thanks to Luca Falavigna for the bug report.
 .
   [ Jackson Doak ]
   * New upstream release (0.7.17)
   * debian/control: Bump standards to 3.9.5 (no changes)
Checksums-Sha1: 
 44a7cae1c504dec48fc3024ee01d72e4b87f210b 2155 ocrodjvu_0.7.17-1.dsc
 f908206f7e4bc2e0a3fdcd17e7710b4d448150fb 785888 ocrodjvu_0.7.17.orig.tar.gz
 3772149d8346e3008e46fe1647cf9a4da3b1c0c6 4476 ocrodjvu_0.7.17-1.debian.tar.xz
 5805c8c76fbe1cf8e24eb10effb46e46f0bf0464 41834 ocrodjvu_0.7.17-1_all.deb
Checksums-Sha256: 
 7adfe9a5b8c5c11d6a5270fba163e7da7993e1f382ef9e98c834b4a486a8e8cd 2155 
ocrodjvu_0.7.17-1.dsc
 f3575c662a201e50d0a9664fe9d2829b8fbf4ae970dace2e52526bc9395268e1 785888 
ocrodjvu_0.7.17.orig.tar.gz
 23ae506c27ad85bc98fc890d523729ae03eddd1eccdce28482dcd8efe65c74a4 4476 
ocrodjvu_0.7.17-1.debian.tar.xz
 6ff42d1276868e7334b7dbf273adc48c1a11219a484bc532fcafbd211119c187 41834 
ocrodjvu_0.7.17-1_all.deb
Files: 
 f478854a3d426890058fd09a9724c4c8 2155 text optional ocrodjvu_0.7.17-1.dsc
 1e57a6e4f2b5dd2a95494cb1179a9eec 785888 text optional 
ocrodjvu_0.7.17.orig.tar.gz
 4d4e2a53745bb1b427d64d335e30c75f 4476 text optional 
ocrodjvu_0.7.17-1.debian.tar.xz
 a249fffdf3ff2488246bb8f6b8e6424f 41834 text optional ocrodjvu_0.7.17-1_all.deb

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBCAAGBQJTAHmUAAoJEI7tzBuqHzL/crkQAJtGXUNMrKW0U2Dl1VODkkxh
fxwauZF6Ttoqbk3m51yn3cQ1QcDKbXFwbghlcj0js8ciPAZN+OQ6yiZur6Mc+ME9
bFiSLZqO6M+6tvY7JQhdZwVcIhHeGFy23/x4BXKcUZTVx44trvBzn4S8KUqgyC8u
nOvcpO0HlaDV7O0nR5HUFKGk/mkuKLBjvOlRtb7TKS/O35ZG89NGwnX9Gtk0CneI
CZ0HyMUDwpuL0tipjCKj+hkXu3OZplXf4157Gzbzp9OkcZ4gGB2p4yJTTgBikLT7
gP2DcSg4W4E867KK0dneH2xCKsF5jkyAmnXXUYiSY64syhPUmRhdttd4pRvLTcKD
GnoHEY3zOmqEtgPTv/i329qOzCh9N3uR5pFtXeklU3fRavCIavwnI7CqzunS/CsZ
51oga+zVXj65UE6nAnhGH5o1tMmpIWVdoZyHr/DW+WdlCc4NhPuDWjU2h14hIu9Z
tnYVr1KV9dyS7loOBtCntcQtlNeZD9XLL7+Rkoq+Jhq9WvR792zYeTiblS55Vu3F
frAQ0KlwjAsEuMON7cadghgbKoTaSpCdlcyy/8gDdbHFMGLh8kSynvLzvyVUZGei
OdT0Cm964Al0LcDYb7eFPJ3jb9plEd9/5dMfY3l0VCjk/TAr8DNqVod9T6FsVC8X
6ZgYvJgQNwztR0l2psAt
=Xypk
-----END PGP SIGNATURE-----

--- End Message ---

Bug#671764: marked as done (ocrodjvu crashes on non-utf8 from tesseract)

Reply via email to