Re: [Mayan EDMS: 368] Re: OCR error on .doc files

Steve Kersley Wed, 28 Nov 2012 09:15:04 -0800

Apologies for hijacking this topic, but having what looks a very similar 
issue.  Did you ever get your problem solved?
Office files (both Excel and Word) fail to OCR and as with the original 
post, just show 'error' in the OCR queue.


Following the advice in this topic, have run the commands manually. 
 Libreoffice runs fine and converts the file to a PDF which I've opened and 
views correctly:
libreoffice --headless --convert-to pdf 
e2582b36-bbfd-4f00-b162-506bafa58f7e --outdir /tmp
convert /tmp/e2582b36-bbfd-4f00-b162-506bafa58f7e -> 
/tmp/e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf using writer_pdf_Export

Unpaper however fails, and searching the Internet has given no specific 
answers.
unpaper --overwrite --no-multi-pages 
e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf  /tmp
Processing sheet: e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf -> /tmp
*** error: input file format using magic '%P' is unknown.
*** error: Cannot load image e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf.
*** error: sheet size unknown, use at least one input file per sheet, or 
force using --sheet-size.

I've tried specifying A4 as sheet size but that made no difference.  I 
spotted a setting 'COMMON_DEFAULT_PAPER_SIZE' which defaults to Letter so 
changed that to A4 (not that that would make a difference when running the 
commands on the command line manually!)  This was with a Word .docx. 
 Exactly the same error with an Excel .xslx file.  In both cases nothing 
clever about the files - just basic text and standard styles in the Word 
doc, a short list of names and dates in the Excel file.  I have also 
uploaded a PDF and a text file, which did both parse correctly and generate 
thumbnails and extract the text, so is just office files?

Anyone have any clues?
Server is Ubuntu 12.04.  Libreoffice 3.5, Tesseract 3.x, Unpaper 0.3 
installed from repository.  Mayan 0.12.2 installed in virtualenv by hand 
rather than using the fabric file, but using pip and installing the 
specific versions of packages listed in requirements/production.txt.

Cheers,
Steve.

--

Re: [Mayan EDMS: 368] Re: OCR error on .doc files

Reply via email to