On Dec 13, 07 09:19:11 +0100, Ciaran Farrell wrote: > Am Thursday 13 December 2007 schrieb StephenW: > > --- Roger Oberholtzer <[EMAIL PROTECTED]> wrote: > > > Hello > > > > > > We have a network printer that will scan docs and send them as pdf docs > > > to an e-mail address in the company. Is there any software with OpenSUSE > > > 10.3 that can do OCR from a PDF doc? I am guessing that the doc contains > > > tiff images of the scanned documents. Any and all pointers are welcome. > > I had to do much the same in the past - a quick bash script seemed like the > best way to solve it: > > 1. use pdf2ppm to extract the images from the pdf to a new directory > 2. use ppm2tiff on all the extracted ppm files > 3. use tesseract or whatever its called these days on the tiff files > 4. append the text files to a single text file (or leave them separate, > whatever) > > There's probably a much more sensible way of doing this :-) but this worked > consistently for me for quite a number of documents scanned and sent as pdf.
This is already the best approach, afaik. I assume ocropus helps layout issus like multicolumn and such. Any volunteers who want to try out ocropus? I see rpm packages in http://download.opensuse.org/repositories/home:/StefanBruens cheers, Jw. -- o \ Juergen Weigert paint it green! __/ _=======.=======_ <V> | [EMAIL PROTECTED] wide open suse_/ _---|____________\/ \ | 0911 74053-508 (tm)__/ (____/ /\ (/) | __________________________/ _/ \_ vim:set sw=2 wm=8 SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nuernberg) "Novell is committed to creating a work environment that embraces clarity." -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
