Dear Samuele, Thanks for your answer. I installed the OCROpus, but it's somehow not able to handle pdf files. Here is the result I get after trying to OCR a pdf: $ ocroscript recognize demo.pdf ocroscript: /usr/share/ocropus/scripts//recognize.lua:180: demo.pdf: file has an unknown extension
I also tried what you've suggested: $ sudo -u www-data python \ /opt/invenio/lib/python/invenio/websubmit_file_converter.py --special-pdf2hocr2pdf=mydoc.pdf --debug --output=output.pdf However, I received bunch of errors like the following: ERROR: ERROR: Error in running ['/usr/bin/convert', '/opt/invenio/var/tmp/conversionsoz5JK/image-1.ppm', '-rotate', '90', '-depth', '8', '/opt/invenio/var/tmp/conversionsoz5JK/rotated.ppm'] stdout: stderr: I think that OCROpus not being able to OCR pdfs causes the problem, but I'm not completely sure. Do you have any idea on what could have gone wrong? Unlike pdfs, I can OCR png files, though. Is there another configuration like --special-pdf2hocr2pdf which I can convert png to pdf with? Since I can have scanned documents in different kinds of formats including pdf and some image formats, it doesn't really matter from which format I'm converting, as long as it works. Regards, Yigit Günay -----Ursprüngliche Nachricht----- Von: Samuele Kaplun [mailto:[email protected]] Gesendet: Freitag, 8. März 2013 16:11 An: Guenay, Yigit Cc: [email protected] Betreff: Re: OCR of Documents Dear Yigit, In data giovedì 7 marzo 2013 11:16:50, Guenay, Yigit ha scritto: > I'm trying to upload a scanned document into the server and perform > OCR on it. I've employed the following command, but I guess the > directives aren't quite correct: > > ../bin/bibdocfile --revise (or append) mydoc.pdf --recid=1015 > --with-flags='PDF/A,OCRED' --textify --with-ocr --with-format=pdf > > Does someone have any experience on OCR in Invenio? Since I wasn't > able to find any documentation on that - besides the help output of > the command which didn't help much either, I would like to know if OCR > is ever possible in Invenio and how to do it if it's possible. first of all, do you have a working OCROpus installation? (That requires installing OCROpus 0.3.1 (newer versions are not compatible with Invenio): <http://ocropus.googlecode.com/files/ocropus-0.3.1.tar.gz> and Tesseract 2.04: <http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz> You will also require some Tesseract dictionaries, and all the necessary dependencies. If you install all these from sources e.g. under /opt/ocropus you should take care that /opt/ocropus/bin is in your PATH variable when you launch the ./configure script upon Invenio installation. In order to verify that OCROpus is well installed you can then just launch from a Python prompt: [...] from invenio.websubmit_file_converter import CFG_CAN_DO_OCR print CFG_CAN_DO_OCR [...] which should return a nice True value. I’d suggest then to first try a manual OCR via Invenio by using: $ sudo -u www-data python \ /opt/invenio/lib/python/invenio/websubmit_file_converter.py --special- pdf2hocr2pdf=mydoc.pdf --debug --output=output.pdf And see if this gives results. Additionally, note that you can not run bibdocfile --revise at the same time of --textify. Also the OCRED flag is reserved in case you are uploading a PDF that was originally coming from a scanned document and on which OCR has already been performed with the recognized text stored in the background. Cheers! Samuele -- Samuele Kaplun Invenio Developer ** <http://invenio-software.org/>

