Re: [ol-discuss] Recording the quality of a book's OCR

Lee Passey Tue, 03 Jan 2012 11:59:52 -0800

On Tue, January 3, 2012 11:46 am, Edward Betts wrote:

> I agree we need a way to correct our books. I find it difficult to read
> the ePub or Kindle versions of scanned books because of the OCR errors.
> I'm sure the OCR errors are irritating when using text-to-speech.


As a first step to move this process along I have created a servlet which
parses the Abbyy output file and converts it to HTML.

The resultant HTML identifies words which Abbyy could not find in its internal
dictionary, and words which contain uncertain characters.

Some attempt is made to identify blocks of text which are centered, and to set
the relative font size for those spans of text which are significantly larger
or smaller than the norm.

Existing line breaks are preserved by the addition of a <br class="ocr"/>
element. Line-ending soft hyphens (as identified by Abbyy) are replaced by
"&shy;~".

Word coordinates are maintained (or perhaps more accurately, recomputed) and
are attached to each word as a "title" attribute, e.g. <span class="word"
title="(760,843),(791,996)">, where the first cartesian pair is the upper left
coordiate, and the second pair is the lower right coordinate. (Technically,
the coordinates are presented as (y,x). Should I change this to (x,y)?)

A link to "archive.css" is added to the beginning of the file. This allows an
end user to, among other things, highlight or otherwise mark words which are
uncertain or not in the dictionary. To view the document without the original
line breaks, add "br.ocr { display:none }" to the .css file.

My next step will be to allow the document to be gzipped before downloading.
After that, I will add an option to break the HTML file into multiple files,
each of which matches a single page image, and return the collection as a zip
archive.

To use this service, in a browser navigate to
"http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]";, where [iaid] is the
Internet Archive IDentifer for a specific work.  For example,
"http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546"; will return the
_Writings of Henry David Thoreau_, and
"http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft"; will return
_Tarzan of the Apes_.

This servlet makes an HTTP connection to the Internet Archive to download the
*_abbyy.gz file, builds a DOM in memory, does an overall evaluation, some
transformations, then serializes it to the servlet output. It is running on an
old Pentium /// in my basement at the end of a DSL line, so expect it to be
quite slow (it could require a matter of minutes to construct the file). Also
be kind; try not to overload or monopolize it. If the Internet Archive would
like to give me access to a servlet engine on a fast server with a fat pipe
I'm sure performance would be vastly improved.

Feedback is encouraged.

Cheers,
Lee

_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to