Re: [ol-discuss] Recording the quality of a book's OCR

Lee Passey Tue, 10 Jan 2012 13:27:15 -0800

On 1/3/2012 12:52 PM, Lee Passey wrote:

> On Tue, January 3, 2012 11:46 am, Edward Betts wrote:
>
>> I agree we need a way to correct our books. I find it difficult to
>> read the ePub or Kindle versions of scanned books because of the
>> OCR errors. I'm sure the OCR errors are irritating when using
>> text-to-speech.
>
> As a first step to move this process along I have created a servlet
> which parses the Abbyy output file and converts it to HTML.


[snip]

> My next step will be to allow the document to be gzipped before
> downloading. After that, I will add an option to break the HTML file
> into multiple files, each of which matches a single page image, and
> return the collection as a zip archive.

This step is now complete. Additionally, there is now an option to omit
(or include, depending on your perspective) word coordinates.

> To use this service, in a browser navigate to
> "http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]";, where [iaid] is
> the Internet Archive IDentifer for a specific work.

By default, Abbyy output is returned as a single HTML file without word
coordinates. To add coordinates add "&coords" to the query string; e.g.
"http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546&coords";. To
return the file as a gzipped HTML file add "&gzip" to the query string.
To return the file as a zip archive of HTML files where each file
represents a single page (and should have the same naming convention as
the image files) add "&zip" to the query string.

Note that "&zip" and "&gzip" are incompatible, with "&zip" taking
precedence; if you use both options, "&gzip" will be ignored. If you
were building an online editing tool you would probably want to use a
query string like this:

"http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft&coords&zip";

Again, let me remind you that this service is running on an old Pentium
/// in my basement at the end of a DSL line, so expect it to be quite
slow (it could require a matter of minutes to construct the file).

As always, feedback is encouraged.

Cheers,
Lee
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to