On Thu, Nov 5, 2015 at 9:33 PM, Jon Leech <[email protected]> wrote: > On Thu, Nov 05, 2015 at 08:45:13PM -0500, Tom Morris wrote: > > > Also, any patterns that you've noticed about when > > pages are missing and when they're not would be helpful. > > It seems formulaic. If the problem occurs, the text from the first > page of each chapter in the scan is missing in the epub. Maybe there are > others missing too, haven't noticed that if so. I haven't tried to > establish relationships between metadata of the books and bad epubs or > anything fancy like that - I just gave up on epubs after noticing this > happening over and over, and went to PDFs instead.
Interesting that you should say that, because it matches exactly with a hypothesis that I developed by inspection of the sole example. The _scandata.xml <https://ia700802.us.archive.org/32/items/billgalactichero00harr/billgalactichero00harr_scandata.xml> file contains these page types: Normal 221 Chapter 19 Color Card 2 Cover 2 Copyright 1 Title 1 The code <https://github.com/internetarchive/epub/blob/master/iabook_to_daisy.py#L154> handles cover, title, title page, copyright, contents, and normal. Can you see what page type is missing? Every scan where the page layout analysis has identified a chapter start page will be missing that page from its epub. No idea what proportion of the IA books this effects. I've added the bug to the issue tracker: https://github.com/internetarchive/epub/issues/31 I've got a fix in hand and will generate a pull request as soon as I have some test data to test with. Tom
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech Archives: http://www.mail-archive.com/[email protected]/ To unsubscribe from this mailing list, send email to [email protected]
