Re: [opensuse] PDF OCR

Ciaran Farrell Thu, 13 Dec 2007 00:19:57 -0800

Am Thursday 13 December 2007 schrieb StephenW:
> --- Roger Oberholtzer <[EMAIL PROTECTED]> wrote:
> > Hello
> >
> > We have a network printer that will scan docs and send them as pdf docs
> > to an e-mail address in the company. Is there any software with OpenSUSE
> > 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
> > tiff images of the scanned documents. Any and all pointers are welcome.


I had to do much the same in the past - a quick bash script seemed like the 
best way to solve it:

1. use pdf2ppm to extract the images from the pdf to a new directory
2. use ppm2tiff on all the extracted ppm files
3. use tesseract or whatever its called these days on the tiff files
4. append the text files to a single text file (or leave them separate, 
whatever)

There's probably a much more sensible way of doing this :-) but this worked 
consistently for me for quite a number of documents scanned and sent as pdf.

Ciaran



-- 
SUSE LINUX Products GmbH
GF: Markus Rex
HRB 16746 (AG Nuremberg)
Maxfeldstrasse 5
90409, Nuremberg
Tel: +49 911 74053 262

signature.asc
Description: This is a digitally signed message part.

Re: [opensuse] PDF OCR

Reply via email to