Re: [ol-tech] How are 'abbyy' files generated? (was Re: Epubs with missing pages)

Tom Morris Fri, 06 Nov 2015 09:20:06 -0800

On Fri, Nov 6, 2015 at 4:01 AM, Jon Leech <[email protected]> wrote:

> On Thu, Nov 05, 2015 at 10:16:04PM -0500, Tom Morris wrote:
> > I've got a fix in hand and will generate a pull request as soon as I have
> > some test data to test with.
>
>     It looks like the 'epub' project requires 'abbyy' OCR output as a
> starting point. Is the toolchain for going from raw scans to abbyy also
> available, so we might be able to generate our own individual test
> datasets from our own books? I skimmed over all the other github
> internetarchive projects, but it wasn't apparent which, if any of them
> handles the scan->abbyy steps of the pipeline.



OCR is done by the commercial software package ABBYY FineReader
http://www.abbyy.com/finereader/

This happens automatically on the IA server farm as part of the "Derive"
process after a scanned file is uploaded.

Tom

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] How are 'abbyy' files generated? (was Re: Epubs with missing pages)

Reply via email to