An issue dear to my heart - I have quite a quantity of documentation here to scan so I did quite a bit of homework and testing on this. It may not exactly be your issue, but I hope it helps.
For me, the issue was the quality of the documentation - print back in those days was quite variable of course and over time documents may have deteriorated. I find Adobe Acrobat the best tool using the ClearScan method for OCR. Most OCR tools working with PDF place a searchable image or overlay in the document. This can bloat file size and is not the same as ClearScan - retaining reasonable file sizes was one of my criteria. I specifically tested this and found Clear Scan documents to have smaller file size that OCR processing using other methods. On my experience Clear Scan also tended to improve the quality of the type while faithfully preserving it. For documents that I obtained as PDFs that ran into trouble being processed like this, I found that exporting the file to TIFF and then creating a new PDF from the TIFFs worked best (doing this is like dry cleaning for PDF). Downside - Acrobat is probably the most expensive of the PDF tools out here. This might help explain it a bit better: https://acrobatusers.com/tutorials/better-pdf-ocr-clearscan-smaller-looks-better/ There are some open-source alternatives that use a similar approach to ClearScan but I have not specifically tested or evaluated them viz: https://github.com/ncraun/smoothscan Hope this helps! Kevin Parker -----Original Message----- From: Marc Howard via cctalk <[email protected]> Sent: Friday, May 12, 2023 2:13 AM To: General Discussion: On-Topic Posts Only <[email protected]> Cc: Marc Howard <[email protected]> Subject: [cctalk] Are there any useful OCR programs for scanning old listings and producing text with proper formatting Marc Howard <[email protected]> [image: Attachments]May 10, 2023, 8:58 PM (15 hours ago) to cctalk-owner I have some listings I want to convert to ASCII. They're line printer output from a computer that existed from the mid-sixties to the early 70's (Agage AGT series). I can't find any OCR package that can take scanner output (either PDF or JPEG) and convert it to text with roughly the same number of spaces between words as was there originally. Seems like it would be an easy task. The input is non-proportional text from line printer output (actually it might have been printed on a Diablo hytype). And yet all I get is most of the characters with either no or single spacing between words. And it misses quite a bit of scanned characters at that. Anyone have any good experiences trying to do this? I've attached a PDF scan if you have a way to do a test run. Thanks, Marc Howard
