Hi as far as I know there are no other such parameters and your configuration should work well. If you took standard nutch version from repository then you might experience problems with the pdf praser itself. I had to update to pdfbox 0.7.4-dev as 0.7.3 contains some bugs.
Alex 2008/10/29 olivier_coface <[EMAIL PROTECTED]> > > Hello Alexander, > In my nutch-site.xml, both file.content.limit and http.content.limit are > set > to -1 (meaning no limit). > This works fine for msword documents. > Do you think that another parameter has to be set for pdf files? > > > > Alexander Aristov wrote: > > > > I suspect that Nutch has not downloaded full pdf. There is a setting in > > the > > nutch config file to truncate large files. It's efficient for html but > > might > > cause such errors for other formats. > > > > Check this setting and adjust accordingly. > > > > Alexander > > > > 2008/10/29 olivier_coface <[EMAIL PROTECTED]> > > > >> > >> I had the following error when crawling on pdf files (it happened on 2 > >> pdf > >> files): > >> > >> > http://lyra:85/ExternalDocumentation/BusinessComponentApproach_Chapter2.pdf > >> : > >> failed(2,0): Can't be handled as pdf document. java.io.EOFException: > >> Unexpected end of ZLIB input stream > >> > >> Any idea? > >> -- > >> View this message in context: > >> > http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20223893.html > >> Sent from the Nutch - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > > Best Regards > > Alexander Aristov > > > > > > -- > View this message in context: > http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20224816.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- Best Regards Alexander Aristov
