Re: Typing in lost code
I've tried to OCR old Fortran Code from DTIC pdf documents. There were 2 big problems; 1. The copies are very poor to start with and all OCR attempts produced about 75% error rate. 2. Old Fortran code limited variable names to 6 characters so they were generally not descriptive of what they represented. Some characters in the Fortran variables sometimes were missing in the printout and made recovery nearly impossible. I hope that now gov't contracts require code to be archived electronically for posterity. Probably never happen. Doug On 1/22/2022 8:06 PM, Ethan O'Toole via cctalk wrote: Can the listings be OCR'ed? - Ethan Has anyone ever used Amazon Mechanical Turk to employ typists to type in old listings of lost code? Asking for a friend. -- : Ethan O'Toole
Re: Typing in lost code
> That's true generally. Anything other than actual photographs > (continuous tone images) should NOT be run through JPEG because JPEG > is not intended for, and unfit for, anything else. Printouts, line > drawings, and anything else with crisp edges between dark and light > will be messed up by JPEG. PNG and TIFF are examples of appropriate > compression schemes. TIFF actually isn't a compression scheme, it's a tagged file format, and one _can_ specify jpeg compression of images in a TIFF file. Perhaps it would be better to say one should avoid _lossy_ compression schemes on scans with crisp edges or large areas of solid color. These are areas where jpeg will add visible noise. De
Re: Typing in lost code
> On Jan 24, 2022, at 5:57 PM, ben via cctalk wrote: > >> ... > Document source is also a problem. > You would want to keep scan it at the best data format, > not something in a lossey format. That's true generally. Anything other than actual photographs (continuous tone images) should NOT be run through JPEG because JPEG is not intended for, and unfit for, anything else. Printouts, line drawings, and anything else with crisp edges between dark and light will be messed up by JPEG. PNG and TIFF are examples of appropriate compression schemes. paul
Re: Typing in lost code
On 2022-01-23 12:47 p.m., Chuck Guzis via cctalk wrote: On 1/23/22 10:16, Paul Koning via cctalk wrote: Maybe. But OCR programs have had learning features for decades. I've spent quite a lot of time in FineReader learning mode. Material produced on a moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be handled tolerably well. Especially with post-processing that knows what the text patterns should be and converts common misreadings to what they should be. But the listings I mentioned before were entirely unmanageable even after a lot of "learning mode" effort. An annoying wrinkle was that I wasn't dealing with greenbar but rather with Dutch line printer paper that has every other line marked with 5 thin horizontal lines, almost like music score paper. Faded printout with a worn ribbon on a substrate like that is a challenge even for human eyeballs, and all the "machine learning" hype can't conceal the fact that no machine can come anywhere close to a human for dealing with image recognition under tough conditions. The problem is that OCR needs to be 100% accuracy for many purposes. Much short of that requires that the result be inspected by hand line-by-line with the knowledge of what makes sense. Mistaking a single fuzzy 8 for a 6 or a 3, for example can render code inoperative with a very difficult to locate bug. Perhaps an AI might be programmed to separate out the nonsense typos. Old high-speed line printers weren't always wonderful with timing the hammer strikes. I recall some nearly impossible to read Univac 1108 engineering documents, printed on a drum printer. Gave me headaches. At least that's my take. --Chuck Document source is also a problem. You would want to keep scan it at the best data format, not something in a lossey format. Ben.
Re: Typing in lost code
Sorry about the double negative. I was in a hurry as I was supposed to drive over the hill to Santa Cruz for a couple hours. "It is unlikely that no current day OCR will produce an error free listing." Should have read: "It is unlikely that any current day OCR will produce an error free listing." I agree with Chuck. A computer code listing cannot tolerate a single mistake in a number. I recall recovering data from cassette tapes were the tape stuck to the capstan and got folds. Most of the code was in BASIC so had quite a bit of redundancy for the program flow. Luckily, there were few damaged segments with numeric values. The tapes had check sums that helped quite a bit. It is not so in typed listings. As I stated, the code I recovered for the 4004 code would have been lost if I'd not understood the purpose and run the simulation of the code, stopping to see what alternate values did to the execution of the code. It was over 3K of code. Quite a bit for a 4004. It was intended to be loaded into 13 1702A Eproms. There were over 30 points in the code that needed to be resolved. Dwight From: cctalk on behalf of dwight via cctalk Sent: Sunday, January 23, 2022 10:06 AM To: cctalk@classiccmp.org Subject: Re: Typing in lost code It is unlikely that no current day OCR will produce an error free listing. It is possible to train an AI to do this but it requires specific training. It must be on the specific machine code and on the same format. Any generic OCR will have many errors if the text is hard to read. The final product must include notes as to things it is not sure about or it would be useless. I recovered a listing for the 4004 processor that was printed on a ASR33 with ruts on the platen. The right hand 1/4 of letters were missing at several locations across the page. Letters such as F and P, as well as 0 and C were often not well enough printed to distinguish. Luckily F and P were often in context relatively easy to determine but 0 and C were often use to describe a HEX number. Unlike the text on this page, the differences were not always obvious. The final result in working code required noting which things were possibly one or the other. The only way to determine most of these was by using a simulation of the code. Most all the cases for the 0 vrs C were that it was a 0, as these were for initializing a pointer base number ( context of usage ). In one case it was only through the simulation was I able to determine that it was really CC and not 00. Marking locations of uncertainty was essential to determine where to check the program code context. Any OCR that doesn't include possible options and that isn't trained on that particular code is worthless. Dwight From: cctalk on behalf of Noel Chiappa via cctalk Sent: Sunday, January 23, 2022 9:31 AM To: cctalk@classiccmp.org Cc: j...@mercury.lcs.mit.edu Subject: Re: Typing in lost code > From: Gavin Scott > I think if I had a whole lot of old faded greenbar etc. ... Someone may > even have done this already See: https://walden-family.com/impcode/imp-code.pdf Someone's already done the specialist OCR to deal with faded program listings. Noel
Re: Typing in lost code
On 1/23/22 10:16, Paul Koning via cctalk wrote: > Maybe. But OCR programs have had learning features for decades. I've spent > quite a lot of time in FineReader learning mode. Material produced on a > moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can > be handled tolerably well. Especially with post-processing that knows what > the text patterns should be and converts common misreadings to what they > should be. But the listings I mentioned before were entirely unmanageable > even after a lot of "learning mode" effort. An annoying wrinkle was that I > wasn't dealing with greenbar but rather with Dutch line printer paper that > has every other line marked with 5 thin horizontal lines, almost like music > score paper. Faded printout with a worn ribbon on a substrate like that is a > challenge even for human eyeballs, and all the "machine learning" hype can't > conceal the fact that no machine can come anywhere close to a human for > dealing with image recognition under tough conditions. The problem is that OCR needs to be 100% accuracy for many purposes. Much short of that requires that the result be inspected by hand line-by-line with the knowledge of what makes sense. Mistaking a single fuzzy 8 for a 6 or a 3, for example can render code inoperative with a very difficult to locate bug. Perhaps an AI might be programmed to separate out the nonsense typos. Old high-speed line printers weren't always wonderful with timing the hammer strikes. I recall some nearly impossible to read Univac 1108 engineering documents, printed on a drum printer. Gave me headaches. At least that's my take. --Chuck
Re: Typing in lost code
Noel Chiappa wrote: > https://walden-family.com/impcode/imp-code.pdf > Someone's already done the specialist OCR to deal with faded program > listings. I tried to contact the author about converting some of the other IMP listings, but got no reply.
Re: Typing in lost code
> On Jan 23, 2022, at 12:09 PM, Gavin Scott wrote: > > On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk > wrote: >> One consideration is the effort required to repair transcription errors. >> Those that produce syntax errors aren't such an issue; >> those that pass the assembler or compiler but result in bugs (say, a >> mistyped register number) are harder to find. > > You can always have it "turked" twice and compare the results. > > This is also the sort of problem that modern Deep Machine Learning > will just crush. Identifying individual characters should be trivial, > you just have to figure out where the characters are first which could > also be done with ML or you could try to do it some other way (with a > really well registered scan maybe if it's all fixed-width characters). Maybe. But OCR programs have had learning features for decades. I've spent quite a lot of time in FineReader learning mode. Material produced on a moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be handled tolerably well. Especially with post-processing that knows what the text patterns should be and converts common misreadings to what they should be. But the listings I mentioned before were entirely unmanageable even after a lot of "learning mode" effort. An annoying wrinkle was that I wasn't dealing with greenbar but rather with Dutch line printer paper that has every other line marked with 5 thin horizontal lines, almost like music score paper. Faded printout with a worn ribbon on a substrate like that is a challenge even for human eyeballs, and all the "machine learning" hype can't conceal the fact that no machine can come anywhere close to a human for dealing with image recognition under tough conditions. That said, if you have access to a particularly good OCR, it can't hurt to spend a few hours trying to make it cope with the source material in question. But be prepared for disappointment. paul
Re: Typing in lost code
It is unlikely that no current day OCR will produce an error free listing. It is possible to train an AI to do this but it requires specific training. It must be on the specific machine code and on the same format. Any generic OCR will have many errors if the text is hard to read. The final product must include notes as to things it is not sure about or it would be useless. I recovered a listing for the 4004 processor that was printed on a ASR33 with ruts on the platen. The right hand 1/4 of letters were missing at several locations across the page. Letters such as F and P, as well as 0 and C were often not well enough printed to distinguish. Luckily F and P were often in context relatively easy to determine but 0 and C were often use to describe a HEX number. Unlike the text on this page, the differences were not always obvious. The final result in working code required noting which things were possibly one or the other. The only way to determine most of these was by using a simulation of the code. Most all the cases for the 0 vrs C were that it was a 0, as these were for initializing a pointer base number ( context of usage ). In one case it was only through the simulation was I able to determine that it was really CC and not 00. Marking locations of uncertainty was essential to determine where to check the program code context. Any OCR that doesn't include possible options and that isn't trained on that particular code is worthless. Dwight From: cctalk on behalf of Noel Chiappa via cctalk Sent: Sunday, January 23, 2022 9:31 AM To: cctalk@classiccmp.org Cc: j...@mercury.lcs.mit.edu Subject: Re: Typing in lost code > From: Gavin Scott > I think if I had a whole lot of old faded greenbar etc. ... Someone may > even have done this already See: https://walden-family.com/impcode/imp-code.pdf Someone's already done the specialist OCR to deal with faded program listings. Noel
Re: Typing in lost code
On Sun, Jan 23, 2022 at 11:31 AM Noel Chiappa via cctalk wrote: > See: > > https://walden-family.com/impcode/imp-code.pdf > > Someone's already done the specialist OCR to deal with faded program listings. Neat. Though all the complex character recognition part of that work is now like 15-20 lines of Python code (using either Keras or PyTorch).
Re: Typing in lost code
> From: Gavin Scott > I think if I had a whole lot of old faded greenbar etc. ... Someone may > even have done this already See: https://walden-family.com/impcode/imp-code.pdf Someone's already done the specialist OCR to deal with faded program listings. Noel
Re: Typing in lost code
On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk wrote: > One consideration is the effort required to repair transcription errors. > Those that produce syntax errors aren't such an issue; > those that pass the assembler or compiler but result in bugs (say, a mistyped > register number) are harder to find. You can always have it "turked" twice and compare the results. This is also the sort of problem that modern Deep Machine Learning will just crush. Identifying individual characters should be trivial, you just have to figure out where the characters are first which could also be done with ML or you could try to do it some other way (with a really well registered scan maybe if it's all fixed-width characters). I think if I had a whole lot of old faded greenbar etc. I would consider manually converting a few pages then setup a Kaggle competition for it and maybe invest a bit of money as a prize. Someone may even have done this already (there have certainly been a number of "OCR historical documents" competitions), but I didn't spend too much time searching. I'm sure you're not the only one who has had this problem to solve.
Re: Typing in lost code
I recently dealt with this with the DaJen SCI monitor listing out of the manual. The copy is pretty bad, and either their printer was having issues, or slashing of "zero" vs "O" was inconsistent somehow. OCRing it produced more of a mess than just sitting with the original and a text editor open side-by-side. I can't imagine it would've worked out well to have someone who wasn't familiar with 8080 assembly language transcribe it, I had a rough enough time on my own, and ended up having to compare the assembly output to a known-good ROM dump to get the last of the discrepancies out. Thanks, Jonathan ‐‐‐ Original Message ‐‐‐ On Sunday, January 23rd, 2022 at 10:11, Paul Koning via cctalk wrote: > I've run into that situation too, with listings so difficult that even a > commercial OCR program (FineReader) couldn't handle it. At the time Tesseract > was far less capable, though I haven't tried it recently to see if that has > changed. > > Anyway, my experience was that the task was hard enough that it needed > someone with knowledge of the material. It may be a contract typist could do > a tolerable job but I have my doubts. Typing, say, an obsolete assembly > language program if you see it merely as a random collection of characters is > going to produce more errors than if the person doing the typing actually > understands what the material means. > > One consideration is the effort required to repair transcription errors. > Those that produce syntax errors aren't such an issue; those that pass the > assembler or compiler but result in bugs (say, a mistyped register number) > are harder to find. > > paul > > > On Jan 22, 2022, at 8:57 PM, Mark Kahrs via cctalk cctalk@classiccmp.org > > wrote: > > > > No, OCR totally fails on olde line printer listing. At least the ones I've > > > > tried (tesseract, online, ...) > > > > On Sat, Jan 22, 2022 at 8:06 PM Ethan O'Toole et...@757.org wrote: > > > > > Can the listings be OCR'ed? > > > > > >- Ethan > > > > > > > > > > Has anyone ever used Amazon Mechanical Turk to employ typists to type in > > > > > > > > old listings of lost code? > > > > > > > > Asking for a friend.
Re: Typing in lost code
I've run into that situation too, with listings so difficult that even a commercial OCR program (FineReader) couldn't handle it. At the time Tesseract was far less capable, though I haven't tried it recently to see if that has changed. Anyway, my experience was that the task was hard enough that it needed someone with knowledge of the material. It may be a contract typist could do a tolerable job but I have my doubts. Typing, say, an obsolete assembly language program if you see it merely as a random collection of characters is going to produce more errors than if the person doing the typing actually understands what the material means. One consideration is the effort required to repair transcription errors. Those that produce syntax errors aren't such an issue; those that pass the assembler or compiler but result in bugs (say, a mistyped register number) are harder to find. paul > On Jan 22, 2022, at 8:57 PM, Mark Kahrs via cctalk > wrote: > > No, OCR totally fails on olde line printer listing. At least the ones I've > tried (tesseract, online, ...) > > > > On Sat, Jan 22, 2022 at 8:06 PM Ethan O'Toole wrote: > >> >> Can the listings be OCR'ed? >> >>- Ethan >> >> >>> Has anyone ever used Amazon Mechanical Turk to employ typists to type in >>> old listings of lost code? >>> >>> Asking for a friend.
Re: Typing in lost code
No, OCR totally fails on olde line printer listing. At least the ones I've tried (tesseract, online, ...) On Sat, Jan 22, 2022 at 8:06 PM Ethan O'Toole wrote: > > Can the listings be OCR'ed? > > - Ethan > > > > Has anyone ever used Amazon Mechanical Turk to employ typists to type in > > old listings of lost code? > > > > Asking for a friend. > > > > -- > : Ethan O'Toole > > >
Re: Typing in lost code
Can the listings be OCR'ed? - Ethan Has anyone ever used Amazon Mechanical Turk to employ typists to type in old listings of lost code? Asking for a friend. -- : Ethan O'Toole
Typing in lost code
Has anyone ever used Amazon Mechanical Turk to employ typists to type in old listings of lost code? Asking for a friend.