On Fri, Sep 25, 2015 at 03:58:40PM -0400, Tom Morris wrote:
> Someone asked me off-list what types of OpenLibrary data cleanups I'd
> suggest.  Below is the list that I came up with off the top of my head.
> What others would folks suggest?  What do you think is more important?
>
> Possible data cleanup targets:
...
> - clean OCR'd texts (actually an IA task, not OL)

    There seem to be quite a lot of ebooks in OL which are simply
missing pages and pages. It appears to be a systematic problem in the
scan -> ebook step, since the PDFs have all been OK in the cases I've
looked at.

    A library which is offering books that don't have all their pages is
not actually providing a useful service, so this is most important in
your list, IMO. If not fixing the ebooks, at least some automated way to
attempt to tag the broken ones and remove them. Since the broken ebooks
I came across were all broken in structurally similar ways (missing
pages at the start of the first chapter, and I think often at the start
of other chapters as well), perhaps that's amenable to automated
detection by comparing ebook and PDF.

    Jon
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to