Re: Unexpected end of ZLIB input stream when parsing pdf files

Alexander Aristov Wed, 29 Oct 2008 04:49:23 -0700

Hi

as far as I know there are no other such parameters and your configuration
should work well. If you took standard nutch version from repository then
you might experience problems with the pdf praser itself. I had to update to
pdfbox 0.7.4-dev as 0.7.3 contains some bugs.


Alex

2008/10/29 olivier_coface <[EMAIL PROTECTED]>

>
> Hello Alexander,
> In my nutch-site.xml, both file.content.limit and http.content.limit are
> set
> to -1 (meaning no limit).
> This works fine for msword documents.
> Do you think that another parameter has to be set for pdf files?
>
>
>
> Alexander Aristov wrote:
> >
> > I suspect that Nutch has not downloaded full pdf. There is a setting in
> > the
> > nutch config file to truncate large files. It's efficient for html but
> > might
> > cause such  errors for other formats.
> >
> > Check this setting and adjust accordingly.
> >
> > Alexander
> >
> > 2008/10/29 olivier_coface <[EMAIL PROTECTED]>
> >
> >>
> >> I had the following error when crawling on pdf files (it happened on 2
> >> pdf
> >> files):
> >>
> >>
> http://lyra:85/ExternalDocumentation/BusinessComponentApproach_Chapter2.pdf
> >> :
> >> failed(2,0): Can't be handled as pdf document. java.io.EOFException:
> >> Unexpected end of ZLIB input stream
> >>
> >> Any idea?
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20223893.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20224816.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Best Regards
Alexander Aristov

Re: Unexpected end of ZLIB input stream when parsing pdf files

Reply via email to