On Sun, 27 Sep 2015, Johnny Billquist wrote:
That would be possible, I guess. But I would so like to remember, refind what I used back then. The results it produced was pretty much identical to the original. Manuals, in comparison, would be pretty straight forward. (Less fonts, and less strange layouts than books, in my eye. Figures still needs to be bitmaps, though.)

While I have no way of knowing what you were using back in the day, something else to keep in mind about books of text V manuals, . . . with a book of text, there is significantly more context for every item. As a trivial example, in English language text, a 'Q' is virtually never followed by anything other than 'u', space, or punctuation. Therefore, if there is a letter following a 'Q', it can be assumed to be 'u' unless proven otherwise. Not so for part numbers, variable names, etc. Applying "spell-checking" to a document gives a very high initial set of probabilities for letters that might otherwise be unclear. If a given font has a very stylistic 'e', then its likelihood can be checked just with letter frequency, and if extremely common and surrounded by other letters, it is quite unlikely to be a slashed '0'. In general, '0', 'O', '1', 'l', 'I' can generally be differentiated by context, such as whether surrounded by numbers or letters, but NOT as reliably based on shape, or even pixel matching.

Therefore, some OCR programs that make use of some of those kinds of techniques might do great on text, but be bordering on unusable for tech documents.


My idea was to make human assisted OCR, by displaying the OCR in progress, with color coding of characters based on their probability of accuracy. Then, cheap labor could manually enter characters, starting with those that had lowest probability of accuracy. Minor heuristic algorithms could then use the incoming data of additional character/pixel pattern pairs to improve the guesses of subsequent characters. The cumulative data pairs would learn additional fonts.

The cheap labor could be neighborhood kids, off-shore out-sourcing, or even grad students, depending on how much you care about their quality of life and cost of living. For premium quality, use workers who even have some knowledge of the material.

BUT, one must never pick a worker who was brought up to interchange '0' and 'O', '1' and 'l', etc. (Remember when some typewriters didn't HAVE both characters?)

Reply via email to