I've been trying for years to get usable scans of old computer listings. Look at the attached file. It was printed with a Diablo HyType and new ribbon. There are only 64 unique characters (6 bit character set) in the entire listing. Courier (non-proportional) font. And yet the OCRs are miserable, even ChatGPTs feeble effort.
Marc On Wed, Dec 3, 2025 at 2:55 PM Paul Koning via cctalk <[email protected]> wrote: > > > > On Dec 3, 2025, at 10:55 AM, Adrian Godwin via cctalk < > [email protected]> wrote: > > > > I don't think it's the general quality of the patent print that's poor, > > it's the line-printer listing section from > > https://www.hp9845.net/9845/downloads/patents/US4089059.pdf starting at > > about page 213 of the pdf , possibly section 26 of the patent. > > > > The print in that section is much paler than the rest - typical of a worn > > line-printer ribbon. I doubt the printed copy is any better. I'm only > > trying to OCR the listing, not the rest of the patent. > > That's quite a cleaen listing, actually, cleaner than most I have worked > with and dramatically better than some. The sort of slightly-damaged > characters that appear should be no problem at all for the "training" > feature of ABBYY Fine Reader to deal with. What you'd have to do is run a > number of pages through it in training mode, so it sees a number of > variations of the individual characters. And as I mentioned, you'd do all > the scanning in the mode where it only accepts what it was trained with, no > "builtin" patterns. That way it won't make up stuff that isn't part of the > character set but happens to match something built-in, like a > pound-sterling sign. > > It may be that scanning the listing as a table (with the various columns > as table columns) will work well, and give you the layout explicitly. Or > it can be scanned as plain text, but in that case the spacing will mostly > turn into individual spaces and you'd need post-processing to insert tabs > etc. to make it look right again. Given the simple assembler syntax > involved that sort of post-processing would not be hard. > > paul > > >
