On Thu, Nov 5, 2015 at 9:33 PM, Jon Leech <[email protected]> wrote:

> On Thu, Nov 05, 2015 at 08:45:13PM -0500, Tom Morris wrote:
>
> > Also, any patterns that you've noticed about when
> > pages are missing and when they're not would be helpful.
>
>     It seems formulaic. If the problem occurs, the text from the first
> page of each chapter in the scan is missing in the epub. Maybe there are
> others missing too, haven't noticed that if so. I haven't tried to
> establish relationships between metadata of the books and bad epubs or
> anything fancy like that - I just gave up on epubs after noticing this
> happening over and over, and went to PDFs instead.


Interesting that you should say that, because it matches exactly with a
hypothesis that I developed by inspection of the sole example.

The _scandata.xml
<https://ia700802.us.archive.org/32/items/billgalactichero00harr/billgalactichero00harr_scandata.xml>
file contains these page types:

Normal 221
Chapter 19
Color Card 2
Cover 2
Copyright 1
Title 1

The code
<https://github.com/internetarchive/epub/blob/master/iabook_to_daisy.py#L154>
handles
cover, title, title page, copyright, contents, and normal.  Can you see
what page type is missing?  Every scan where the page layout analysis has
identified a chapter start page will be missing that page from its epub.
No idea what proportion of the IA books this effects.

I've added the bug to the issue tracker:
https://github.com/internetarchive/epub/issues/31

I've got a fix in hand and will generate a pull request as soon as I have
some test data to test with.

Tom
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/[email protected]/
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to