On 1/3/2012 12:52 PM, Lee Passey wrote: > On Tue, January 3, 2012 11:46 am, Edward Betts wrote: > >> I agree we need a way to correct our books. I find it difficult to >> read the ePub or Kindle versions of scanned books because of the >> OCR errors. I'm sure the OCR errors are irritating when using >> text-to-speech. > > As a first step to move this process along I have created a servlet > which parses the Abbyy output file and converts it to HTML.
[snip] > My next step will be to allow the document to be gzipped before > downloading. After that, I will add an option to break the HTML file > into multiple files, each of which matches a single page image, and > return the collection as a zip archive. This step is now complete. Additionally, there is now an option to omit (or include, depending on your perspective) word coordinates. > To use this service, in a browser navigate to > "http://www.ebookcoop.net/ebookcoop/FromIA?[iaid]", where [iaid] is > the Internet Archive IDentifer for a specific work. By default, Abbyy output is returned as a single HTML file without word coordinates. To add coordinates add "&coords" to the query string; e.g. "http://www.ebookcoop.net/ebookcoop/FromIA?cu31924097556546&coords". To return the file as a gzipped HTML file add "&gzip" to the query string. To return the file as a zip archive of HTML files where each file represents a single page (and should have the same naming convention as the image files) add "&zip" to the query string. Note that "&zip" and "&gzip" are incompatible, with "&zip" taking precedence; if you use both options, "&gzip" will be ignored. If you were building an online editing tool you would probably want to use a query string like this: "http://www.ebookcoop.net/ebookcoop/FromIA?tarzanofapes00burruoft&coords&zip" Again, let me remind you that this service is running on an old Pentium /// in my basement at the end of a DSL line, so expect it to be quite slow (it could require a matter of minutes to construct the file). As always, feedback is encouraged. Cheers, Lee _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
