Re: Typing in lost code

2022-01-24 Thread Douglas Taylor via cctalk
I've tried to OCR old Fortran Code from DTIC pdf documents.  There were 
2 big problems;


1. The copies are very poor to start with and all OCR attempts produced 
about 75% error rate.
2. Old Fortran code limited variable names to 6 characters so they were 
generally not descriptive of what they represented.  Some characters in 
the Fortran variables sometimes were missing in the printout and made 
recovery nearly impossible.


I hope that now gov't contracts require code to be archived 
electronically for posterity.  Probably never happen.


Doug

On 1/22/2022 8:06 PM, Ethan O'Toole via cctalk wrote:


Can the listings be OCR'ed?

    - Ethan



Has anyone ever used Amazon Mechanical Turk to employ typists to type in
old listings of lost code?

Asking for a friend.



--
: Ethan O'Toole






Re: Typing in lost code

2022-01-24 Thread Dennis Boone via cctalk
 > That's true generally.  Anything other than actual photographs
 > (continuous tone images) should NOT be run through JPEG because JPEG
 > is not intended for, and unfit for, anything else.  Printouts, line
 > drawings, and anything else with crisp edges between dark and light
 > will be messed up by JPEG.  PNG and TIFF are examples of appropriate
 > compression schemes.

TIFF actually isn't a compression scheme, it's a tagged file format, and
one _can_ specify jpeg compression of images in a TIFF file.

Perhaps it would be better to say one should avoid _lossy_ compression
schemes on scans with crisp edges or large areas of solid color.  These
are areas where jpeg will add visible noise.

De


Re: Typing in lost code

2022-01-24 Thread Paul Koning via cctalk



> On Jan 24, 2022, at 5:57 PM, ben via cctalk  wrote:
> 
>> ...
> Document source is also a problem.
> You would want to keep scan it at the best data format,
> not something in a lossey format.

That's true generally.  Anything other than actual photographs (continuous tone 
images) should NOT be run through JPEG because JPEG is not intended for, and 
unfit for, anything else.  Printouts, line drawings, and anything else with 
crisp edges between dark and light will be messed up by JPEG.  PNG and TIFF are 
examples of appropriate compression schemes.

paul




Re: Typing in lost code

2022-01-24 Thread ben via cctalk

On 2022-01-23 12:47 p.m., Chuck Guzis via cctalk wrote:

On 1/23/22 10:16, Paul Koning via cctalk wrote:



Maybe.  But OCR programs have had learning features for decades.  I've spent quite a lot of time in 
FineReader learning mode.  Material produced on a moderate-quality typewriter, like the CDC 6600 
wire lists on Bitsavers, can be handled tolerably well.  Especially with post-processing that knows 
what the text patterns should be and converts common misreadings to what they should be.  But the 
listings I mentioned before were entirely unmanageable even after a lot of "learning 
mode" effort.  An annoying wrinkle was that I wasn't dealing with greenbar but rather with 
Dutch line printer paper that has every other line marked with 5 thin horizontal lines, almost like 
music score paper.  Faded printout with a worn ribbon on a substrate like that is a challenge even 
for human eyeballs, and all the "machine learning" hype can't conceal the fact that no 
machine can come anywhere close to a human for dealing with image recognition under tough 
conditions.


The problem is that OCR needs to be 100% accuracy for many purposes.
Much short of that requires that the result be inspected by hand
line-by-line with the knowledge of what makes sense.   Mistaking a
single fuzzy 8 for a 6 or a 3, for example can render code inoperative
with a very difficult to locate bug.   Perhaps an AI might be programmed
to separate out the nonsense typos.

Old high-speed line printers weren't always wonderful with timing the
hammer strikes.  I recall some nearly impossible to read Univac 1108
engineering documents, printed on a drum printer.  Gave me headaches.

At least that's my take.

--Chuck


Document source is also a problem.
You would want to keep scan it at the best data format,
not something in a lossey format.
Ben.





Re: Typing in lost code

2022-01-23 Thread dwight via cctalk
Sorry about the double negative.
I was in a hurry as I was supposed to drive over the hill to Santa Cruz for a 
couple hours.

"It is unlikely that no current day OCR will produce an error free listing."
Should have read:
"It is unlikely that any current day OCR will produce an error free listing."

I agree with Chuck. A computer code listing cannot tolerate a single mistake in 
a number. I recall recovering data from cassette tapes were the tape stuck to 
the capstan and got folds. Most of the code was in BASIC so had quite a bit of 
redundancy for the program flow. Luckily, there were few damaged segments with 
numeric values. The tapes had check sums that helped quite a bit. It is not so 
in typed listings.
As I stated, the code I recovered for the 4004 code would have been lost if I'd 
not understood the purpose and run the simulation of the code, stopping to see 
what alternate values did to the execution of the code. It was over 3K of code. 
Quite a bit for a 4004. It was intended to be loaded into 13 1702A Eproms. 
There were over 30 points in the code that needed to be resolved.
Dwight


From: cctalk  on behalf of dwight via cctalk 

Sent: Sunday, January 23, 2022 10:06 AM
To: cctalk@classiccmp.org 
Subject: Re: Typing in lost code

It is unlikely that no current day OCR will produce an error free listing.
It is possible to train an AI to do this but it requires specific training. It 
must be on the specific machine code and on the same format. Any generic OCR 
will have many errors if the text is hard to read.
The final product must include notes as to things it is not sure about or it 
would be useless. I recovered a listing for the 4004 processor that was printed 
on a ASR33 with ruts on the platen. The right hand 1/4 of letters were missing 
at several locations across the page. Letters such as F and P, as well as 0 and 
C were often not well enough printed to distinguish.
Luckily F and P were often in context relatively easy to determine but 0 and C 
were often use to describe a HEX number. Unlike the text on this page, the 
differences were not always obvious. The final result in working code required 
noting which things were possibly one or the other. The only way to determine 
most of these was by using a simulation of the code. Most all the cases for the 
0 vrs C were that it was a 0, as these were for initializing a pointer base 
number ( context of usage ). In one case it was only through the simulation was 
I able to determine that it was really CC and not 00.
Marking locations of uncertainty was essential to determine where to check the 
program code context.
Any OCR that doesn't include possible options and that isn't trained on that 
particular code is worthless.
Dwight


From: cctalk  on behalf of Noel Chiappa via 
cctalk 
Sent: Sunday, January 23, 2022 9:31 AM
To: cctalk@classiccmp.org 
Cc: j...@mercury.lcs.mit.edu 
Subject: Re: Typing in lost code

> From: Gavin Scott

> I think if I had a whole lot of old faded greenbar etc. ... Someone may
> even have done this already

See:

  https://walden-family.com/impcode/imp-code.pdf

Someone's already done the specialist OCR to deal with faded program listings.

Noel


Re: Typing in lost code

2022-01-23 Thread Chuck Guzis via cctalk
On 1/23/22 10:16, Paul Koning via cctalk wrote:


> Maybe.  But OCR programs have had learning features for decades.  I've spent 
> quite a lot of time in FineReader learning mode.  Material produced on a 
> moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can 
> be handled tolerably well.  Especially with post-processing that knows what 
> the text patterns should be and converts common misreadings to what they 
> should be.  But the listings I mentioned before were entirely unmanageable 
> even after a lot of "learning mode" effort.  An annoying wrinkle was that I 
> wasn't dealing with greenbar but rather with Dutch line printer paper that 
> has every other line marked with 5 thin horizontal lines, almost like music 
> score paper.  Faded printout with a worn ribbon on a substrate like that is a 
> challenge even for human eyeballs, and all the "machine learning" hype can't 
> conceal the fact that no machine can come anywhere close to a human for 
> dealing with image recognition under tough conditions.

The problem is that OCR needs to be 100% accuracy for many purposes.
Much short of that requires that the result be inspected by hand
line-by-line with the knowledge of what makes sense.   Mistaking a
single fuzzy 8 for a 6 or a 3, for example can render code inoperative
with a very difficult to locate bug.   Perhaps an AI might be programmed
to separate out the nonsense typos.

Old high-speed line printers weren't always wonderful with timing the
hammer strikes.  I recall some nearly impossible to read Univac 1108
engineering documents, printed on a drum printer.  Gave me headaches.

At least that's my take.

--Chuck




Re: Typing in lost code

2022-01-23 Thread Lars Brinkhoff via cctalk
Noel Chiappa wrote:
> https://walden-family.com/impcode/imp-code.pdf
> Someone's already done the specialist OCR to deal with faded program
> listings.

I tried to contact the author about converting some of the other IMP
listings, but got no reply.


Re: Typing in lost code

2022-01-23 Thread Paul Koning via cctalk



> On Jan 23, 2022, at 12:09 PM, Gavin Scott  wrote:
> 
> On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
>  wrote:
>> One consideration is the effort required to repair transcription errors.  
>> Those that produce syntax errors aren't such an issue;
>> those that pass the assembler or compiler but result in bugs (say, a 
>> mistyped register number) are harder to find.
> 
> You can always have it "turked" twice and compare the results.
> 
> This is also the sort of problem that modern Deep Machine Learning
> will just crush. Identifying individual characters should be trivial,
> you just have to figure out where the characters are first which could
> also be done with ML or you could try to do it some other way (with a
> really well registered scan maybe if it's all fixed-width characters).

Maybe.  But OCR programs have had learning features for decades.  I've spent 
quite a lot of time in FineReader learning mode.  Material produced on a 
moderate-quality typewriter, like the CDC 6600 wire lists on Bitsavers, can be 
handled tolerably well.  Especially with post-processing that knows what the 
text patterns should be and converts common misreadings to what they should be. 
 But the listings I mentioned before were entirely unmanageable even after a 
lot of "learning mode" effort.  An annoying wrinkle was that I wasn't dealing 
with greenbar but rather with Dutch line printer paper that has every other 
line marked with 5 thin horizontal lines, almost like music score paper.  Faded 
printout with a worn ribbon on a substrate like that is a challenge even for 
human eyeballs, and all the "machine learning" hype can't conceal the fact that 
no machine can come anywhere close to a human for dealing with image 
recognition under tough conditions.

That said, if you have access to a particularly good OCR, it can't hurt to 
spend a few hours trying to make it cope with the source material in question.  
But be prepared for disappointment.

paul




Re: Typing in lost code

2022-01-23 Thread dwight via cctalk
It is unlikely that no current day OCR will produce an error free listing.
It is possible to train an AI to do this but it requires specific training. It 
must be on the specific machine code and on the same format. Any generic OCR 
will have many errors if the text is hard to read.
The final product must include notes as to things it is not sure about or it 
would be useless. I recovered a listing for the 4004 processor that was printed 
on a ASR33 with ruts on the platen. The right hand 1/4 of letters were missing 
at several locations across the page. Letters such as F and P, as well as 0 and 
C were often not well enough printed to distinguish.
Luckily F and P were often in context relatively easy to determine but 0 and C 
were often use to describe a HEX number. Unlike the text on this page, the 
differences were not always obvious. The final result in working code required 
noting which things were possibly one or the other. The only way to determine 
most of these was by using a simulation of the code. Most all the cases for the 
0 vrs C were that it was a 0, as these were for initializing a pointer base 
number ( context of usage ). In one case it was only through the simulation was 
I able to determine that it was really CC and not 00.
Marking locations of uncertainty was essential to determine where to check the 
program code context.
Any OCR that doesn't include possible options and that isn't trained on that 
particular code is worthless.
Dwight


From: cctalk  on behalf of Noel Chiappa via 
cctalk 
Sent: Sunday, January 23, 2022 9:31 AM
To: cctalk@classiccmp.org 
Cc: j...@mercury.lcs.mit.edu 
Subject: Re: Typing in lost code

> From: Gavin Scott

> I think if I had a whole lot of old faded greenbar etc. ... Someone may
> even have done this already

See:

  https://walden-family.com/impcode/imp-code.pdf

Someone's already done the specialist OCR to deal with faded program listings.

Noel


Re: Typing in lost code

2022-01-23 Thread Gavin Scott via cctalk
On Sun, Jan 23, 2022 at 11:31 AM Noel Chiappa via cctalk
 wrote:
> See:
>
>   https://walden-family.com/impcode/imp-code.pdf
>
> Someone's already done the specialist OCR to deal with faded program listings.

Neat. Though all the complex character recognition part of that work
is now like 15-20 lines of Python code (using either Keras or
PyTorch).


Re: Typing in lost code

2022-01-23 Thread Noel Chiappa via cctalk
> From: Gavin Scott

> I think if I had a whole lot of old faded greenbar etc. ... Someone may
> even have done this already 

See:

  https://walden-family.com/impcode/imp-code.pdf

Someone's already done the specialist OCR to deal with faded program listings.

Noel


Re: Typing in lost code

2022-01-23 Thread Gavin Scott via cctalk
On Sun, Jan 23, 2022 at 9:11 AM Paul Koning via cctalk
 wrote:
> One consideration is the effort required to repair transcription errors.  
> Those that produce syntax errors aren't such an issue;
> those that pass the assembler or compiler but result in bugs (say, a mistyped 
> register number) are harder to find.

You can always have it "turked" twice and compare the results.

This is also the sort of problem that modern Deep Machine Learning
will just crush. Identifying individual characters should be trivial,
you just have to figure out where the characters are first which could
also be done with ML or you could try to do it some other way (with a
really well registered scan maybe if it's all fixed-width characters).

I think if I had a whole lot of old faded greenbar etc. I would
consider manually converting a few pages then setup a Kaggle
competition for it and maybe invest a bit of money as a prize. Someone
may even have done this already (there have certainly been a number of
"OCR historical documents" competitions), but I didn't spend too much
time searching. I'm sure you're not the only one who has had this
problem to solve.


Re: Typing in lost code

2022-01-23 Thread Jonathan Chapman via cctalk
I recently dealt with this with the DaJen SCI monitor listing out of the 
manual. The copy is pretty bad, and either their printer was having issues, or 
slashing of "zero" vs "O" was inconsistent somehow. OCRing it produced more of 
a mess than just sitting with the original and a text editor open side-by-side.

I can't imagine it would've worked out well to have someone who wasn't familiar 
with 8080 assembly language transcribe it, I had a rough enough time on my own, 
and ended up having to compare the assembly output to a known-good ROM dump to 
get the last of the discrepancies out.

Thanks,
Jonathan

‐‐‐ Original Message ‐‐‐

On Sunday, January 23rd, 2022 at 10:11, Paul Koning via cctalk 
 wrote:

> I've run into that situation too, with listings so difficult that even a 
> commercial OCR program (FineReader) couldn't handle it. At the time Tesseract 
> was far less capable, though I haven't tried it recently to see if that has 
> changed.
>
> Anyway, my experience was that the task was hard enough that it needed 
> someone with knowledge of the material. It may be a contract typist could do 
> a tolerable job but I have my doubts. Typing, say, an obsolete assembly 
> language program if you see it merely as a random collection of characters is 
> going to produce more errors than if the person doing the typing actually 
> understands what the material means.
>
> One consideration is the effort required to repair transcription errors. 
> Those that produce syntax errors aren't such an issue; those that pass the 
> assembler or compiler but result in bugs (say, a mistyped register number) 
> are harder to find.
>
> paul
>
> > On Jan 22, 2022, at 8:57 PM, Mark Kahrs via cctalk cctalk@classiccmp.org 
> > wrote:
> >
> > No, OCR totally fails on olde line printer listing. At least the ones I've
> >
> > tried (tesseract, online, ...)
> >
> > On Sat, Jan 22, 2022 at 8:06 PM Ethan O'Toole et...@757.org wrote:
> >
> > > Can the listings be OCR'ed?
> > >
> > >- Ethan
> > >
> > >
> > > > Has anyone ever used Amazon Mechanical Turk to employ typists to type in
> > > >
> > > > old listings of lost code?
> > > >
> > > > Asking for a friend.


Re: Typing in lost code

2022-01-23 Thread Paul Koning via cctalk
I've run into that situation too, with listings so difficult that even a 
commercial OCR program (FineReader) couldn't handle it.  At the time Tesseract 
was far less capable, though I haven't tried it recently to see if that has 
changed.

Anyway, my experience was that the task was hard enough that it needed someone 
with knowledge of the material.  It may be a contract typist could do a 
tolerable job but I have my doubts.  Typing, say, an obsolete assembly language 
program if you see it merely as a random collection of characters is going to 
produce more errors than if the person doing the typing actually understands 
what the material means.

One consideration is the effort required to repair transcription errors.  Those 
that produce syntax errors aren't such an issue; those that pass the assembler 
or compiler but result in bugs (say, a mistyped register number) are harder to 
find.

paul

> On Jan 22, 2022, at 8:57 PM, Mark Kahrs via cctalk  
> wrote:
> 
> No, OCR totally fails on olde line printer listing.  At least the ones I've
> tried (tesseract, online, ...)
> 
> 
> 
> On Sat, Jan 22, 2022 at 8:06 PM Ethan O'Toole  wrote:
> 
>> 
>> Can the listings be OCR'ed?
>> 
>>- Ethan
>> 
>> 
>>> Has anyone ever used Amazon Mechanical Turk to employ typists to type in
>>> old listings of lost code?
>>> 
>>> Asking for a friend.



Re: Typing in lost code

2022-01-23 Thread Mark Kahrs via cctalk
No, OCR totally fails on olde line printer listing.  At least the ones I've
tried (tesseract, online, ...)



On Sat, Jan 22, 2022 at 8:06 PM Ethan O'Toole  wrote:

>
> Can the listings be OCR'ed?
>
> - Ethan
>
>
> > Has anyone ever used Amazon Mechanical Turk to employ typists to type in
> > old listings of lost code?
> >
> > Asking for a friend.
> >
>
> --
> : Ethan O'Toole
>
>
>


Re: Typing in lost code

2022-01-22 Thread Ethan O'Toole via cctalk



Can the listings be OCR'ed?

- Ethan



Has anyone ever used Amazon Mechanical Turk to employ typists to type in
old listings of lost code?

Asking for a friend.



--
: Ethan O'Toole




Typing in lost code

2022-01-22 Thread Mark Kahrs via cctalk
Has anyone ever used Amazon Mechanical Turk to employ typists to type in
old listings of lost code?

Asking for a friend.