retitle 572522 ocrodjvu: crashes with ValueError on malformed hOCR
severity 572522 minor
thanks

ocrodjvu --render all --engine cuneiform --language pol --clear-text -o 
out.djvu in.djvu
Processing 'in.djvu':
- Page #1
- Page #2
Exception in thread Thread-2:
Traceback (most recent call last):
 File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner
   self.run()
 File "/usr/lib/python2.5/threading.py", line 446, in run
   self.__target(*self.__args, **self.__kwargs)
 File "/usr/share/ocrodjvu/lib/_ocrodjvu.py", line 443, in page_thread
   result = self.process_page(page)
 File "/usr/share/ocrodjvu/lib/_ocrodjvu.py", line 423, in process_page
   page_size=size
 File "/usr/share/ocrodjvu/lib/hocr.py", line 457, in extract_text
   scan_result = scan(doc.find('/body'), settings)
 File "/usr/share/ocrodjvu/lib/hocr.py", line 419, in scan
   _scan(node, buffer, BBox(), settings)
 File "/usr/share/ocrodjvu/lib/hocr.py", line 394, in _scan
   look_down(result, bbox)
 File "/usr/share/ocrodjvu/lib/hocr.py", line 342, in look_down
   _scan(child, buffer, parent_bbox, settings)
 File "/usr/share/ocrodjvu/lib/hocr.py", line 407, in _scan
   result[:] = _replace_cuneiform08_paragraph(result[:], settings)
 File "/usr/share/ocrodjvu/lib/hocr.py", line 234, in 
_replace_cuneiform08_paragraph
   raise ValueError
ValueError

ocrodjvu indeed crashes, but on the garbage-in-garbage-out principle. If you run ocrodjvu with the --debug option, you'll see that resulting hOCR files don't contain anything legible. In fact, hOCR for page 2 contains also some control characters, which completely break HTML parsing, leading to a crash.

I cannot do much about this, except making the error message more helpful.

--
Jakub Wilk

Attachment: signature.asc
Description: Digital signature

Reply via email to