Re: [ol-tech] Epubs with missing pages (was Re: What data cleanups would you like to see?)

Tom Morris Thu, 05 Nov 2015 17:45:49 -0800

Thanks for the quick reply, Jon.  It really needs to be something which
doesn't require borrowing, because, as far as I know, the source for those
will always be unavailable.

For this example, the directory of files is:
https://archive.org/download/billgalactichero00harr
and the OCR text (what I call the "source") is:
https://archive.org/download/billgalactichero00harr/billgalactichero00harr_abbyy.gz
which gives a permission violation when you attempt to download it.
Ditto with the epub:
https://archive.org/download/billgalactichero00harr/billgalactichero00harr.epub

I really need a test case that is publicly accessible.  Of course, they IA
staff has access to everything, so they could fix the bug themselves, but
for a community contributed fix we'll need a public test case to have much
hope of finding it.  Also, any patterns that you've noticed about when
pages are missing and when they're not would be helpful.

Tom

On Thu, Nov 5, 2015 at 8:26 PM, Jon Leech <oddh...@sonic.net> wrote:

> On Thu, Nov 05, 2015 at 08:07:02PM -0500, Tom Morris wrote:
> > On Mon, Sep 28, 2015 at 2:28 AM, Greg Lindahl <lind...@pbm.com> wrote:
> > > On Fri, Sep 25, 2015 at 05:06:13PM -0700, Jon Leech wrote:
> > >
> > > >     Just for comparison, though, here's an example:
> > > >
> > > >     https://openlibrary.org/books/OL7433769M/Bill_the_Galactic_Hero
> > >
> > > The code which does the conversion is open, and hasn't been updated
> > > since 2011. See https://github.com/internetarchive/epub
> > >
> > > Perhaps we can dig up a Python-literate volunteer who'd like to look
> > > into it?
> > >
> >
> > If it gets added to the repo's issue tracker (
> > https://github.com/internetarchive/epub/issues), preferably with
> examples
> > where both the source and generated epub are available, I'll take a look
> > when I have time.  This example requires the epub to be borrowed and the
> > source XML isn't available at all.
>
>     I can give a few more examples of epubs I previously borrowed that I
> noted were missing pages - but I don't think the public catalog says
> what the source XML is, or whether it's available for a title, does it?
>
>     Would there be any point to my filing a github issue with more
> sample titles, in the hope one of them had the source available for
> someone who has the internal OL access to make use of it? That seems to
> be the blocking point. I'm actually willing to take a look at the epub
> generation code myself, but without the input documents to refer to,
> would have no way of diagnosing the problem.
>
>     Jon
> _______________________________________________
> Ol-tech mailing list
> ol-tech@archive.org
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> Archives: http://www.mail-archive.com/ol-tech@archive.org/
> To unsubscribe from this mailing list, send email to
> ol-tech-unsubscr...@archive.org
>

_______________________________________________
Ol-tech mailing list
ol-tech@archive.org
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
Archives: http://www.mail-archive.com/ol-tech@archive.org/
To unsubscribe from this mailing list, send email to 
ol-tech-unsubscr...@archive.org

Re: [ol-tech] Epubs with missing pages (was Re: What data cleanups would you like to see?)

Reply via email to