Apologies for hijacking this topic, but having what looks a very similar issue. Did you ever get your problem solved? Office files (both Excel and Word) fail to OCR and as with the original post, just show 'error' in the OCR queue.
Following the advice in this topic, have run the commands manually. Libreoffice runs fine and converts the file to a PDF which I've opened and views correctly: libreoffice --headless --convert-to pdf e2582b36-bbfd-4f00-b162-506bafa58f7e --outdir /tmp convert /tmp/e2582b36-bbfd-4f00-b162-506bafa58f7e -> /tmp/e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf using writer_pdf_Export Unpaper however fails, and searching the Internet has given no specific answers. unpaper --overwrite --no-multi-pages e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf /tmp Processing sheet: e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf -> /tmp *** error: input file format using magic '%P' is unknown. *** error: Cannot load image e2582b36-bbfd-4f00-b162-506bafa58f7e.pdf. *** error: sheet size unknown, use at least one input file per sheet, or force using --sheet-size. I've tried specifying A4 as sheet size but that made no difference. I spotted a setting 'COMMON_DEFAULT_PAPER_SIZE' which defaults to Letter so changed that to A4 (not that that would make a difference when running the commands on the command line manually!) This was with a Word .docx. Exactly the same error with an Excel .xslx file. In both cases nothing clever about the files - just basic text and standard styles in the Word doc, a short list of names and dates in the Excel file. I have also uploaded a PDF and a text file, which did both parse correctly and generate thumbnails and extract the text, so is just office files? Anyone have any clues? Server is Ubuntu 12.04. Libreoffice 3.5, Tesseract 3.x, Unpaper 0.3 installed from repository. Mayan 0.12.2 installed in virtualenv by hand rather than using the fabric file, but using pip and installing the specific versions of packages listed in requirements/production.txt. Cheers, Steve. --
