Re: [opensuse] PDF OCR

Juergen Weigert Thu, 13 Dec 2007 09:07:43 -0800

On Dec 13, 07 09:19:11 +0100, Ciaran Farrell wrote:
> Am Thursday 13 December 2007 schrieb StephenW:
> > --- Roger Oberholtzer <[EMAIL PROTECTED]> wrote:
> > > Hello
> > >
> > > We have a network printer that will scan docs and send them as pdf docs
> > > to an e-mail address in the company. Is there any software with OpenSUSE
> > > 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains
> > > tiff images of the scanned documents. Any and all pointers are welcome.
> 
> I had to do much the same in the past - a quick bash script seemed like the 
> best way to solve it:
> 
> 1. use pdf2ppm to extract the images from the pdf to a new directory
> 2. use ppm2tiff on all the extracted ppm files
> 3. use tesseract or whatever its called these days on the tiff files
> 4. append the text files to a single text file (or leave them separate, 
> whatever)
> 
> There's probably a much more sensible way of doing this :-) but this worked 
> consistently for me for quite a number of documents scanned and sent as pdf.


This is already the best approach, afaik.
I assume ocropus helps layout issus like multicolumn and such.

Any volunteers who want to try out ocropus?
I see rpm packages in
http://download.opensuse.org/repositories/home:/StefanBruens

        cheers,
                Jw.

-- 
 o \  Juergen Weigert  paint it green! __/ _=======.=======_
<V> | [EMAIL PROTECTED]       wide open suse_/        _---|____________\/
 \  | 0911 74053-508         (tm)__/          (____/            /\
(/) | __________________________/             _/ \_ vim:set sw=2 wm=8
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg)
"Novell is committed to creating a work environment that embraces clarity."

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [opensuse] PDF OCR

Reply via email to