On Jul 18, 2019, at 14:50, Warner Losh via cctalk <cctalk@classiccmp.org> wrote: > So, I have a bunch of old DEC Rainbow docs that aren't online. I also have > a snapscan scanner that I use for bills and such.
I do this kind of thing, with a ScanSnap S1500M (M means Mac), but mostly don't mind that the process is destructive to the books. Really, the only things that survive this process are the things that start out looseleaf, and as I’m trying to get some silverfish food out of my life, most of them get recycled too. Really there is a ScanSnap for the text block and a flatbed for the covers. I scan covers and/or dust jackets first, on the flatbed, usually 300dpi color. Sometimes front cover with spine if I think the design is interesting. Books come apart. Yes, glue bound books get crumbly bindings after 30 years or so and come apart easily. Newer glue bound books come apart less easily because the binding is still gummy and gooey. You will still want a paper cutter or shear to clean up the gutter by about 1/8 inch (would perhaps use 4mm if my paper cutter had a metric scale) and make its edge less ragged and less gooey. The ScanSnap wants to scan a bunch of sheets/pages and make a PDF for you. It can do an automatic post-scan OCR if you let it, and that works well for account statements and other short documents. Its OCR (which I think is a version of AABBY that Fujitsu/PFUCA licensed for use with the ScanSnap software) is not real good at recognizing multiple columns or tables, it gets the characters but not the layout. The ScanSnap can also try to figure out whether a page image should be scanned as black-and-white, as grayscale, or as color. There are ways to control this if you’re not happy with its choices through defining scanning profiles that influence and limit its choices. So I scan black-and-white text as 400dpi or 600dpi (judgment call). You may find you want to scan one book piecewise so you can use black-and-white for the text-only parts and grayscale or color for the photo plates. http://bitsavers.org/pdf/hp/portablePlus/45559-90001_Portable_PLUS_Technical_Reference_Manual_Aug1985.pdf is an example of one (a looseleaf manual) that I did with a ScanSnap, and I think I did it all in black-and-white at 400dpi. You can see the holes punched for the three-ring binder. Al would put white over them to hide them, but that's because his scanner yields per-page TIFFs where he can get at that. I have got some shell/Perl/netpbm code that does things like that with his sort of scanner filepiles like that, but haven’t got round to something to turn a ScanSnap-produced PDF into a bunch of per-page TIFFs. You can use Adobe Acrobat Pro to gather a bunch of PDFs (and PNGs and TIFFs) into a single PDF, put down page numbers, put down bookmarks that mirror the table of contents. Eric Smith's tumble can do some of these things but I also use Acrobat Pro 8 (which was bundled with the S1500M) for OCR. Its OCR is based on something other than AABBY (I.R.I.S. I think) and does better at multiple columns of text. I do not expect OCR to be perfect, ever. I hope it will be good enough for me to find things I remember reading, and thus far it has worked reasonably well at that. (This via macOS Spotlight.) What is presented for view in Preview is the page image as scanned and there is the possibility to re-OCR the PDF with newer software. ScanSnap software looked much the same on Windows and macOS, and may yet; haven’t seen recent versions of the Windows software. There are differences in how they encode page images in PDFs, e.g. on macOS the software will encode a black-and-white scanned page image using a compression that is lossless but doesn't actually compress very well, and I think this is because macOS code is used to construct the PDF. I use an Acrobat Pro “preflight” configuration to convert these to what is basically TIFF G4 encoding with run-length lossless compression that is better at reducing PDF file size. On Windows, the generated PDF also uses the run-length compression. -Frank McConnell