> On Jan 23, 2022, at 12:09 PM, Gavin Scott <[email protected]> wrote:
> 
> On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
> <[email protected]> wrote:
>> One consideration is the effort required to repair transcription errors.  
>> Those that produce syntax errors aren't such an issue;
>> those that pass the assembler or compiler but result in bugs (say, a 
>> mistyped register number) are harder to find.
> 
> You can always have it "turked" twice and compare the results.
> 
> This is also the sort of problem that modern Deep Machine Learning
> will just crush. Identifying individual characters should be trivial,
> you just have to figure out where the characters are first which could
> also be done with ML or you could try to do it some other way (with a
> really well registered scan maybe if it's all fixed-width characters).

Maybe.  But OCR programs have had learning features for decades.  I've spent 
quite a lot of time in FineReader learning mode.  Material produced on a 
moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be 
handled tolerably well.  Especially with post-processing that knows what the 
text patterns should be and converts common misreadings to what they should be. 
 But the listings I mentioned before were entirely unmanageable even after a 
lot of "learning mode" effort.  An annoying wrinkle was that I wasn't dealing 
with greenbar but rather with Dutch line printer paper that has every other 
line marked with 5 thin horizontal lines, almost like music score paper.  Faded 
printout with a worn ribbon on a substrate like that is a challenge even for 
human eyeballs, and all the "machine learning" hype can't conceal the fact that 
no machine can come anywhere close to a human for dealing with image 
recognition under tough conditions.

That said, if you have access to a particularly good OCR, it can't hurt to 
spend a few hours trying to make it cope with the source material in question.  
But be prepared for disappointment.

        paul


Reply via email to