Dear Yigit, In data giovedì 7 marzo 2013 11:16:50, Guenay, Yigit ha scritto: > I'm trying to upload a scanned document into the server and perform OCR on > it. I've employed the following command, but I guess the directives aren't > quite correct: > > ../bin/bibdocfile --revise (or append) mydoc.pdf --recid=1015 > --with-flags='PDF/A,OCRED' --textify --with-ocr --with-format=pdf > > Does someone have any experience on OCR in Invenio? Since I wasn't able to > find any documentation on that - besides the help output of the command > which didn't help much either, I would like to know if OCR is ever possible > in Invenio and how to do it if it's possible.
first of all, do you have a working OCROpus installation? (That requires installing OCROpus 0.3.1 (newer versions are not compatible with Invenio): <http://ocropus.googlecode.com/files/ocropus-0.3.1.tar.gz> and Tesseract 2.04: <http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz> You will also require some Tesseract dictionaries, and all the necessary dependencies. If you install all these from sources e.g. under /opt/ocropus you should take care that /opt/ocropus/bin is in your PATH variable when you launch the ./configure script upon Invenio installation. In order to verify that OCROpus is well installed you can then just launch from a Python prompt: [...] from invenio.websubmit_file_converter import CFG_CAN_DO_OCR print CFG_CAN_DO_OCR [...] which should return a nice True value. I’d suggest then to first try a manual OCR via Invenio by using: $ sudo -u www-data python \ /opt/invenio/lib/python/invenio/websubmit_file_converter.py --special- pdf2hocr2pdf=mydoc.pdf --debug --output=output.pdf And see if this gives results. Additionally, note that you can not run bibdocfile --revise at the same time of --textify. Also the OCRED flag is reserved in case you are uploading a PDF that was originally coming from a scanned document and on which OCR has already been performed with the recognized text stored in the background. Cheers! Samuele -- Samuele Kaplun Invenio Developer ** <http://invenio-software.org/>

