Re: [ol-discuss] Recording the quality of a book's OCR

Lee Passey Fri, 30 Dec 2011 13:27:39 -0800

On Fri, December 30, 2011 10:35 am, Edward Betts wrote:

> I'll explain scanned image page coordinates for words.
>
> In the book reader we show the scanned images. This is the image of the
> page captured by the book scanner. When you search with the book reader
> it highlights matched words. It does this by drawing a box around the
> word in the image. To draw the box we need to know the location of the
> word on the page, that is the x and y coordinates and the height and
> width of the word in pixels.
>
> This is the reason we're still looking for software to provide
> corrections for OCR. You're right, that one solution is to let people
> download the text, correct it and upload Word files, but it doesn't
> match our requirements.


I suspect that this is an unnecessary requirement.

The purpose of word coordinates is to support text searching in the various
"photo album" formats: FlipBook, DejaVu, and PDF. In these formats the user is
not presented with digitized text, she is presented with the picture of a
page. Of course, one cannot search for text in a blob of pixels, so these
formats have a hidden layer containing text; when a user searches for a word,
the program searches the hidden text, gets the perceived coordinates of that
word on a certain picture, and then presents the picture with that rectangle
highlighted.

For this purpose, crappy OCR is probably good enough. Garbage "words" are
irrelevant, as the end user never sees them. If a word is mis-recognized, and
it just happens to be a word that a user is searching for, the search
algorithm will not find that particular instance of the word, but in the big
picture that failure just doesn't matter much. So for the purpose of backing
"photo album" formats, improved text is probably unnecessary.

This part of my message is where it gets important.

<strong>The people who are asking for a method to improve text don't want to
use a "photo album," they want to use the OCRed text directly.</strong>

For the purpose of improving text files, maintaining word coordinates is
unnecessary. There is no expectation that this improved text become the
backing text for a "photo album" format, nor is there any need for it to do
so. Indeed, the ".txt" files available at archive.org do not maintain word
coordinates, so I can see no reason why anyone would object to those files
being updated by proofreaders.

As I understand it, ePub files are created by dynamically generating HTML from
the Abbyy xml format
(http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml) then packaging
that HTML into the ePub container. I also understand that if an improved HTML
file already exists in the output directory then /that/ file will be used and
a new HTML file will not be generated. If this understanding is correct, then
there is no reason why the HTML file could not also be incrementally improved
without maintaining word coordinates.

My suggestion is to move forward with some sort of incremental improvement
process for .txt and HTML files. There is no need to try and maintain word
coordinates for these files. The current method of developing the text backing
for the "photo album" formats can continue as it currently is without being
impacted by improvements in other formats.

_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to