Dear Yigit,

In data giovedì 7 marzo 2013 11:16:50, Guenay, Yigit ha scritto:
> I'm trying to upload a scanned document into the server and perform OCR on
> it. I've employed the following command, but I guess the directives aren't
> quite correct:
> 
> ../bin/bibdocfile --revise (or append) mydoc.pdf --recid=1015
> --with-flags='PDF/A,OCRED' --textify --with-ocr --with-format=pdf
> 
> Does someone have any experience on OCR in Invenio? Since I wasn't able to
> find any documentation on that - besides the help output of the command
> which didn't help much either, I would like to know if OCR is ever possible
> in Invenio and how to do it if it's possible.

first of all, do you have a working OCROpus installation? (That requires 
installing OCROpus 0.3.1 (newer versions are not compatible with Invenio):

<http://ocropus.googlecode.com/files/ocropus-0.3.1.tar.gz>

and Tesseract 2.04:

<http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz>

You will also require some Tesseract dictionaries, and all the necessary 
dependencies.

If you install all these from sources e.g. under /opt/ocropus you should take 
care that /opt/ocropus/bin is in your PATH variable when you launch the 
./configure script upon Invenio installation.

In order to verify that OCROpus is well installed you can then just launch 
from a Python prompt:

[...]
from invenio.websubmit_file_converter import CFG_CAN_DO_OCR
print CFG_CAN_DO_OCR
[...]

which should return a nice True value.

I’d suggest then to first try a manual OCR via Invenio by using:

$ sudo -u www-data python \ 
/opt/invenio/lib/python/invenio/websubmit_file_converter.py --special-
pdf2hocr2pdf=mydoc.pdf --debug --output=output.pdf

And see if this gives results.

Additionally, note that you can not run bibdocfile --revise at the same time 
of --textify. Also the OCRED flag is reserved in case you are uploading a PDF 
that was originally coming from a scanned document and on which OCR has 
already been performed with the recognized text stored in the background.

Cheers!
        Samuele

-- 
Samuele Kaplun
Invenio Developer ** <http://invenio-software.org/>

Reply via email to