On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk <[email protected]> wrote: > One consideration is the effort required to repair transcription errors. > Those that produce syntax errors aren't such an issue; > those that pass the assembler or compiler but result in bugs (say, a mistyped > register number) are harder to find.
You can always have it "turked" twice and compare the results. This is also the sort of problem that modern Deep Machine Learning will just crush. Identifying individual characters should be trivial, you just have to figure out where the characters are first which could also be done with ML or you could try to do it some other way (with a really well registered scan maybe if it's all fixed-width characters). I think if I had a whole lot of old faded greenbar etc. I would consider manually converting a few pages then setup a Kaggle competition for it and maybe invest a bit of money as a prize. Someone may even have done this already (there have certainly been a number of "OCR historical documents" competitions), but I didn't spend too much time searching. I'm sure you're not the only one who has had this problem to solve.
