Re: OCR of Documents

Guenay, Yigit Tue, 19 Mar 2013 02:41:40 -0700

Dear Samuele,

Thanks for your answer. I installed the OCROpus, but it's somehow not able to 
handle pdf files. Here is the result I get after trying to OCR a pdf:
$ ocroscript recognize demo.pdf
ocroscript: /usr/share/ocropus/scripts//recognize.lua:180: demo.pdf: file has 
an unknown extension


I also tried what you've suggested:
$ sudo -u www-data python \
/opt/invenio/lib/python/invenio/websubmit_file_converter.py 
--special-pdf2hocr2pdf=mydoc.pdf --debug --output=output.pdf

However, I received bunch of errors like the following:
ERROR: ERROR: Error in running ['/usr/bin/convert', 
'/opt/invenio/var/tmp/conversionsoz5JK/image-1.ppm', '-rotate', '90', '-depth', 
'8', '/opt/invenio/var/tmp/conversionsoz5JK/rotated.ppm']
 stdout:

stderr:

I think that OCROpus not being able to OCR pdfs causes the problem, but I'm not 
completely sure. Do you have any idea on what could have gone wrong?

Unlike pdfs, I can OCR png files, though. Is there another configuration like 
--special-pdf2hocr2pdf which I can convert png to pdf with? Since I can have 
scanned documents in different kinds of formats including pdf and some image 
formats, it doesn't really matter from which format I'm converting, as long as 
it works.

Regards,
Yigit Günay

-----Ursprüngliche Nachricht-----
Von: Samuele Kaplun [mailto:[email protected]] 
Gesendet: Freitag, 8. März 2013 16:11
An: Guenay, Yigit
Cc: [email protected]
Betreff: Re: OCR of Documents

Dear Yigit,

In data giovedì 7 marzo 2013 11:16:50, Guenay, Yigit ha scritto:
> I'm trying to upload a scanned document into the server and perform 
> OCR on it. I've employed the following command, but I guess the 
> directives aren't quite correct:
> 
> ../bin/bibdocfile --revise (or append) mydoc.pdf --recid=1015 
> --with-flags='PDF/A,OCRED' --textify --with-ocr --with-format=pdf
> 
> Does someone have any experience on OCR in Invenio? Since I wasn't 
> able to find any documentation on that - besides the help output of 
> the command which didn't help much either, I would like to know if OCR 
> is ever possible in Invenio and how to do it if it's possible.

first of all, do you have a working OCROpus installation? (That requires 
installing OCROpus 0.3.1 (newer versions are not compatible with Invenio):

<http://ocropus.googlecode.com/files/ocropus-0.3.1.tar.gz>

and Tesseract 2.04:

<http://tesseract-ocr.googlecode.com/files/tesseract-2.04.tar.gz>

You will also require some Tesseract dictionaries, and all the necessary 
dependencies.

If you install all these from sources e.g. under /opt/ocropus you should take 
care that /opt/ocropus/bin is in your PATH variable when you launch the 
./configure script upon Invenio installation.

In order to verify that OCROpus is well installed you can then just launch from 
a Python prompt:

[...]
from invenio.websubmit_file_converter import CFG_CAN_DO_OCR print 
CFG_CAN_DO_OCR [...]

which should return a nice True value.

I’d suggest then to first try a manual OCR via Invenio by using:

$ sudo -u www-data python \
/opt/invenio/lib/python/invenio/websubmit_file_converter.py --special- 
pdf2hocr2pdf=mydoc.pdf --debug --output=output.pdf

And see if this gives results.

Additionally, note that you can not run bibdocfile --revise at the same time of 
--textify. Also the OCRED flag is reserved in case you are uploading a PDF that 
was originally coming from a scanned document and on which OCR has already been 
performed with the recognized text stored in the background.

Cheers!
        Samuele

--
Samuele Kaplun
Invenio Developer ** <http://invenio-software.org/>

Re: OCR of Documents

Reply via email to