Disclaimer
----------

I know basically nothing about OCR, handwriting recognition, speech
recognition, or automatic translation.


Tolerance for Ambiguity Makes Things Easier
-------------------------------------------

Traditionally, the goal of OCR, handwriting recognition, speech
recognition, or automatic translation has been to find the single
correct interpretation of marks on paper or sounds in a signal stream.
Often, though, it's much easier to produce a description of a set of
possible interpretations (perhaps as a finite-state machine or
infinite-order Markov model) which reliably includes the correct
interpretation than it is to reliably choose the correct
interpretation.  When is this ambiguous representation good enough to
be useful?

Downstream disambiguation
-------------------------

Traditional (e.g.) OCR software proceeds as a pipeline of stages:
convert an image into marks; convert the marks into symbols; convert
the symbols into letters; turn the letters into lines of words,
columns, paragraphs, and so forth.  In its simplest form, this works
very badly; one way to remedy this is to pass sets of candidates from
one stage to the next.

Ocrad, for example, has a -x option to list all the candidate
transliterations for each symbols, including certainty levels, so that
perhaps a downstream program such as a dictionary-driven spell-checker
could disambiguate symbols by context.  (Unfortunately, it isn't
actually useful in Ocrad 0.11, because it almost never comes up with
more than one candidate, and the candidate weights aren't useful
either.)

This is not the only possible way to do downstream disambiguation; for
example, you could resume the algorithm upstream (with Icon or Python
generators, or Haskell lazy lists, or Prolog backtracking, or
simulating these by hand) to get it to generate more possibilities, or
you could run through the entire pipeline several times, each time
feeding the output of the next stage into the previous stage as a set
of weights.

T9 text input on cell phones is an example of this technique: the
keyboard produces a finite-state-machine specification of a word
(e.g. the regexp '[tuv][ghi][def]') which is then disambiguated to a
list of the most probable candidates (perhaps 'the', 'tie', 'vid', in
this case) by consulting a dictionary; then the person who typed the
word selects the desired word from the list of words.

In an OCR system, this approach could simplify the proofreading
process

Search
------

Suppose you have an NFA representation of a text generated by an OCR
program.  You can construct an NFA representation of the suffixes of
the text by adding a new state with epsilon-transitions to all the
existing states.  Now, if you have an NFA representation of some other
word (generated by speech or handwriting recognition, for example)
that you want to find in this document, you can simulate the two NFAs
in parallel to find what points in the document might contain it.

I suspect this could be very useful.

Given infinite-order Markov models of the text --- that is, NFAs with
probabilities on the transitions --- you could even include the
likelihood of a match in the inputs to a TF/IDF-like ranking
algorithm, so that less certain matches ranked lower.

Related work
------------

Presumably most work in OCR, neural networks, speech recognition,
handwriting recognition, etc., deals with tolerance of ambiguity, and
I wouldn't be surprised if there were a bunch of work dealing with
uses for ambiguous final products.

The Soundex scheme, now nearly a century old, reduces similar-sounding
words to small equivalence classes by omitting all vowels, lumping
consonants into six consonant classes of similar-sounding letters, and
collapsing runs of the same consonant class down into one; the idea is
to support search by removing most of the random information
introduced into surnames by spelling variation, as well as some of the
signal.  You could view a Soundex heading such as "S420" as an NFA
(/^s[aehiouwy]*(l[aehiouwy]*)+([cgjkqsxz][aehiouwy]*)+$/ I think,
producing words like "slice"), so it's pretty closely related in some
sense.

Reply via email to