retitle 572522 ocrodjvu: crashes with ValueError on malformed hOCR severity 572522 minor thanks
ocrodjvu --render all --engine cuneiform --language pol --clear-text -o out.djvu in.djvu Processing 'in.djvu': - Page #1 - Page #2 Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib/python2.5/threading.py", line 486, in __bootstrap_inner self.run() File "/usr/lib/python2.5/threading.py", line 446, in run self.__target(*self.__args, **self.__kwargs) File "/usr/share/ocrodjvu/lib/_ocrodjvu.py", line 443, in page_thread result = self.process_page(page) File "/usr/share/ocrodjvu/lib/_ocrodjvu.py", line 423, in process_page page_size=size File "/usr/share/ocrodjvu/lib/hocr.py", line 457, in extract_text scan_result = scan(doc.find('/body'), settings) File "/usr/share/ocrodjvu/lib/hocr.py", line 419, in scan _scan(node, buffer, BBox(), settings) File "/usr/share/ocrodjvu/lib/hocr.py", line 394, in _scan look_down(result, bbox) File "/usr/share/ocrodjvu/lib/hocr.py", line 342, in look_down _scan(child, buffer, parent_bbox, settings) File "/usr/share/ocrodjvu/lib/hocr.py", line 407, in _scan result[:] = _replace_cuneiform08_paragraph(result[:], settings) File "/usr/share/ocrodjvu/lib/hocr.py", line 234, in _replace_cuneiform08_paragraph raise ValueError ValueError
ocrodjvu indeed crashes, but on the garbage-in-garbage-out principle. If you run ocrodjvu with the --debug option, you'll see that resulting hOCR files don't contain anything legible. In fact, hOCR for page 2 contains also some control characters, which completely break HTML parsing, leading to a crash.
I cannot do much about this, except making the error message more helpful.
-- Jakub Wilk
signature.asc
Description: Digital signature