> On Jan 23, 2022, at 12:09 PM, Gavin Scott <[email protected]> wrote:
>
> On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
> <[email protected]> wrote:
>> One consideration is the effort required to repair transcription errors.
>> Those that produce syntax errors aren't such an issue;
>> those that pass the assembler or compiler but result in bugs (say, a
>> mistyped register number) are harder to find.
>
> You can always have it "turked" twice and compare the results.
>
> This is also the sort of problem that modern Deep Machine Learning
> will just crush. Identifying individual characters should be trivial,
> you just have to figure out where the characters are first which could
> also be done with ML or you could try to do it some other way (with a
> really well registered scan maybe if it's all fixed-width characters).
Maybe. But OCR programs have had learning features for decades. I've spent
quite a lot of time in FineReader learning mode. Material produced on a
moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be
handled tolerably well. Especially with post-processing that knows what the
text patterns should be and converts common misreadings to what they should be.
But the listings I mentioned before were entirely unmanageable even after a
lot of "learning mode" effort. An annoying wrinkle was that I wasn't dealing
with greenbar but rather with Dutch line printer paper that has every other
line marked with 5 thin horizontal lines, almost like music score paper. Faded
printout with a worn ribbon on a substrate like that is a challenge even for
human eyeballs, and all the "machine learning" hype can't conceal the fact that
no machine can come anywhere close to a human for dealing with image
recognition under tough conditions.
That said, if you have access to a particularly good OCR, it can't hurt to
spend a few hours trying to make it cope with the source material in question.
But be prepared for disappointment.
paul