On Tue, Jan 22, 2013 at 12:34:05PM -0500, David H. Durgee wrote: > I am trying to determine how best to scan and save these documents.
I have found the following process to be useful: Scanner input, jpg (or pdf) | v tidy up image using 'unpapered' | v convert to grayscale via ppmtopgm -> pamtotiff | v OCR using tesseract Tesseract can embed the OCR in the PDF (search for tesseract hocr), too. This is a makefile I use to automate that process, starting from a PDF (image only) generated by my scanner: http://www.martindengler.com/proj/scan-post-process-Makefile ...like so: make -f scan-post-process-Makefile $(basename input.pdf .pdf)-processed Tesseract isn't perfect, but it's pretty good. > Dave Martin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: <http://lists.alioth.debian.org/pipermail/sane-devel/attachments/20130122/f38c93f3/attachment.pgp>
