An issue dear to my heart - I have quite a quantity of documentation here to 
scan so I did quite a bit of homework and testing on this. It may not exactly 
be your issue, but I hope it helps. 

For me, the issue was the quality of the documentation - print back in those 
days was quite variable of course and over time documents may have deteriorated.

I find Adobe Acrobat the best tool using the ClearScan method for OCR.  Most 
OCR tools working with PDF place a searchable image or overlay in the document. 
This can bloat file size and is not the same as ClearScan - retaining 
reasonable file sizes was one of my criteria. I specifically tested this and 
found Clear Scan documents to have smaller file size that OCR processing using 
other methods. On my experience Clear Scan also tended to improve the quality 
of the type while faithfully preserving it. 

For documents that I obtained as PDFs that ran into trouble being processed 
like this, I found that exporting the file to TIFF and then creating a new PDF 
from the TIFFs worked best (doing this is like dry cleaning for PDF).

Downside - Acrobat is probably the most expensive of the PDF tools out here.

This might help explain it a bit better:

https://acrobatusers.com/tutorials/better-pdf-ocr-clearscan-smaller-looks-better/

There are some open-source alternatives that use a similar approach to 
ClearScan but I have not specifically tested or evaluated them viz:

https://github.com/ncraun/smoothscan

Hope this helps!



Kevin Parker

-----Original Message-----
From: Marc Howard via cctalk <[email protected]> 
Sent: Friday, May 12, 2023 2:13 AM
To: General Discussion: On-Topic Posts Only <[email protected]>
Cc: Marc Howard <[email protected]>
Subject: [cctalk] Are there any useful OCR programs for scanning old listings 
and producing text with proper formatting

Marc Howard <[email protected]>
[image: Attachments]May 10, 2023, 8:58 PM (15 hours ago) to cctalk-owner I have 
some listings I want to convert to ASCII.  They're line printer output from a 
computer that existed from the mid-sixties to the early 70's (Agage AGT series).

I can't find any OCR package that can take scanner output (either PDF or
JPEG) and convert it to text with roughly the same number of spaces between 
words as was there originally.

Seems like it would be an easy task.  The input is non-proportional text from 
line printer output (actually it might have been printed on a Diablo hytype).  
And yet all I get is most of the characters with either no or single spacing 
between words.  And it misses quite a bit of scanned characters at that.

Anyone have any good experiences trying to do this?  I've attached a PDF scan 
if you have a way to do a test run.

Thanks,

Marc Howard

Reply via email to