On Fri, Nov 6, 2015 at 4:01 AM, Jon Leech <[email protected]> wrote: > On Thu, Nov 05, 2015 at 10:16:04PM -0500, Tom Morris wrote: > > I've got a fix in hand and will generate a pull request as soon as I have > > some test data to test with. > > It looks like the 'epub' project requires 'abbyy' OCR output as a > starting point. Is the toolchain for going from raw scans to abbyy also > available, so we might be able to generate our own individual test > datasets from our own books? I skimmed over all the other github > internetarchive projects, but it wasn't apparent which, if any of them > handles the scan->abbyy steps of the pipeline.
OCR is done by the commercial software package ABBYY FineReader http://www.abbyy.com/finereader/ This happens automatically on the IA server farm as part of the "Derive" process after a scanned file is uploaded. Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
